✨ We've just launched our NEW website design!

Learn More Here
Graphics Cards

AMD RDNA4 Architecture: Complete Overview and Analysis

AMD’s next-generation GPUs are nearly here, utilising the RDNA4 architecture and while you’ve probably seen our previous video about what AMD needs to do to compete with NVIDIA, today we’re focusing specifically on what RDNA4 brings to the table technically. We’ll be looking at everything from architectural improvements to performance numbers, and software features, and even making some educated price predictions.

Just the other day, we were invited onto a briefing session with AMD to go through all of the information that we can let you know about today, and looking at AMD’s slides, we can see a clear progression from RDNA2 through RDNA3 and now to RDNA4. Each generation has brought significant improvements, but RDNA4 represents perhaps the most substantial architectural overhaul since the original RDNA lineup. The most fundamental change appears to be in how the architecture handles resource allocation. RDNA3 used a static register allocation system, where registers would be reserved for specific workloads even when idle. RDNA4 introduces dynamic register allocation, which means those previously idle registers can now be repurposed on the fly. This allows for much more efficient utilization of the silicon and helps explain some of the performance gains we’re seeing. Another major advancement is in the ray tracing hardware side of things.

AMD has a sophisticated new implementation of oriented bounding box optimisation, or OBB for short. For those unfamiliar, this technique dramatically reduces the number of ray intersection tests needed when tracing complex scenes. The traversal heatmaps in the slides show significantly fewer ray tests being performed for the same scene, which translates to much better ray tracing performance.

The compute architecture itself has seen substantial improvements. The biggest gains appear to be in specialized compute operations, with half-precision operations seeing a 2x improvement over RDNA3, while INT8, FP8, and the newly supported BF8 operations show an impressive 4x improvement.

This is particularly significant for AI workloads, which often rely heavily on these lower-precision operations. Digging deeper into the compute enhancements, we can see that AMD has maintained the same F32 (standard floating-point) performance at 256 operations per compute unit and the same F64 (double precision) at 4 operations per CU. However, for the AI-relevant operations, the improvements are substantial. The FP16 and BF16 operations have increased from 512 to 1024 or 2048 operations per CU, depending on the workload. Perhaps even more impressive is the new support for FP8 and BF8 formats, which can achieve 2048 to 4096 operations per CU. The INT8 operations have dramatically improved from 512 to between 2048 and 4096 operations per CU, while INT4 has jumped from 1024 to between 4096 and 8192 operations per CU.

1 2 3 4 5 6Next page

Peter Donnell

As a child in my 40's, I spend my day combining my love of music and movies with a life-long passion for gaming, from arcade classics and retro consoles to the latest high-end PC and console games. So it's no wonder I write about tech and test the latest hardware while I enjoy my hobbies!

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Close

Adblock Detected

Please consider supporting us by disabling your ad blocker!   eTeknix prides itself on supplying the most accurate and informative PC and tech related news and reviews and this is made possible by advertisements but be rest assured that we will never serve pop ups, self playing audio ads or any form of ad that tracks your information as your data security is as important to us as it is to you.   If you want to help support us further you can over on our Patreon!   Thank you for visiting eTeknix