This topic has grown on me over the years as I have seen shader code on slides at conferences, by brilliant people, where the code could have been written in a much better way. Occasionally I hear an “this is unoptimized” or “educational example” attached to it, but most of the time this excuse doesn't hold. I sometimes sense that the author may use “unoptimized” or “educational” as an excuse because they are unsure how to make it right. And then again, code that's shipping in SDK samples from IHVs aren't always doing it right either. When the best of the best aren't doing it right, then we have a problem as an industry.
(x – 0.3) * 2.5 = x * 2.5 + (-0.75)
Assembly languages are dead. The last time I used one was 2003. Since then it has been HLSL and GLSL for everything. I haven't looked back. So shading has of course evolved, and it is a natural development that we are seeing higher level abstractions as we're moving along. Nothing wrong with that. But as the gap between the hardware and the abstractions we are working with widens, there is an increasing risk of losing touch with the hardware. If we only ever see the HLSL code, but never see what the GPU runs, this will become a problem. The message in this presentation is that maintaining a low-level mindset while working in a high-level shading language is crucial for writing high performance shaders.
This is a clear illustration of why we should bother with low-level thinking. With no other change than moving things around a little and adding some parentheses we achieved a substantially faster shader. This is enabled by having an understanding of the underlying HW and mapping of HLSL constructs to it. The HW used in this presentation is a Radeon HD 4870 (selected because it features the most readable disassembly), but most of everything in this slide deck is really general and applies to any GPU unless stated otherwise.
Hardware comes in many configurations that are balanced differently between sub-units. Even if you are not observing any performance increase on your particular GPU, chances are there is another configuration on the market where it makes a difference. Reducing utilization of ALU from say 50% to 25% while bound by something else (TEX/BW/etc.) probably doesn't improve performance, but lets the GPU run cooler. Alternatively, with today's fancy power-budget based clocks could let the hardware maintain a higher clock-rate than it could otherwise, and thereby still run faster.
Compilers only understand the semantics of the operations in the shader. They don't know what you are trying to accomplish. Many possible optimizations are “unsafe” and must thus be done by the shader author.
This is the most trivial example of an piece of code you may think could be optimized automatically to use a MAD instruction instead of ADD + MUL, because both constants are compile time literals and overall very friendly numbers.
Turns out fxc is still not comfortable optimizing it.
The driver is bound by the semantics of the provided D3D byte-code. Final code for the GPU is exactly what was written in the shader. You will see the same results on PS3 too, except in this particular case it seems comfortable turning it into a MAD. Probably because the constant 1.0f there. Any other constant and it behaves just like PC here. The Xbox360 shader compiler is a funny story. It just doesn't care. It does this optimization anyway, always, even when it obviously breaks stuff. It will slap things together even if the resulting constant overflows to infinity, or underflows to become zero. 1.#INF is your constant and off we go! Oh, zero, I only need to do a MUL then, yay! There are of course many more subtle breakages because of this, where you simply lost a whole lot of floating point precision due to the change and it's not obvious why. 12
We are dealing with IEEE floats here. Changing the order of...