Wrapping Paper
I first tinkered with SH wrap shading (as described in part 1) for Splinter Cell: Conviction, since we were using a couple of models [1][2] for some characterspecific materials. Unfortunately, due to the way that indirect character lighting was performed, it would have required additional memory that we couldn’t really justify at that point in development. Consequently, this work was left on the cutting room floor and I only got as far as testing out Green’s model [1].
Recently, however, I spotted that Irradiance Rigs [3] covers similar ground. At the very end of the short paper, they briefly present a generalisation of Valve’s Half Lambert model [2] and the SH convolution terms for the first three bands:
This tidily combines the tunability of [1] with the tighter falloff of [2], albeit at the cost of a few extra instructions in the case of direct lighting. It’s not energyconserving though, so for kicks I went through the maths – see appendix – and made the necessary adjustments:
I would suggest this as a good workout if your calculus skills are a little on the rusty side; think of it as a muchneeded trip to the maths gym: sure it’s going to hurt at first, but you’ll feel better afterwards!
The same authors have since written a more indepth paper, Wrap Shading [4], which Derek Nowrouzezahrai has kindly made available here. I recommend checking it out, since there’s some nice analysis and plenty of background information. One notable insight is that their model is perfectly represented by 3rdorder SH when $a = 1$ (i.e, Half Lambert). This becomes clear when you consider that the model is effectively unclamped in that case, so appropriate scaling of the constant, linear and quadratic bands () will match the function:
A similar observation can be made with Green’s model: it’s perfectly represented by 2ndorder SH when $a = 1$.
Shrink Wrap
But wait, at the end of the part 1, didn’t I promise that there would be a discussion of optimisation in this post? You’re quite right. Well, it just so happens that a snippet of reference shader code from this last paper makes for a neat little case study on improving shader performance.
Reference Version
This is pretty much the reference implementation for generating the normalised convolution terms of their generalised model:
1 2 3 4 5 6 7 8 9 10 11 12 13 

The only thing that I’ve changed – beyond adding calling code – is to pass in the wrap parameter fA
from the vertex shader. It was previously a usersupplied constant, which doesn’t make for a particularly credible example, since in that case all of the maths could simply be moved to the CPU and performed just the once!
Note that there’s been some attempt to pull out common terms, particularly for the final component, where instead of fA*fA  2*fA + 3
(see $\mathbf{f}$) we now have fA*fA  t.x + 5
.
Without further ado, let’s see how this stacks up in terms of ps_3_0
instructions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

Ouch! 16 is fairly substantial, but perhaps not all that surprising going by the HLSL. Since this is deviceindependent assembly, I decided to check the ALU count on Xbox 360 for comparison. In that case it’s a somewhat more reasonable 10 operations, because 5 scalar ops get dualissued with vector ops. So, in summary, we have:
DX9: 16, X360: 10(+5) ALU ops
Cancellation
Immediately, a simple but very effective change we can make is to cancel through by the normalisation term, which leaves us with $\mathbf{\hat{f}}$ directly:
1 2 3 4 5 

Don’t expect the compiler to do intelligent optimisations like this; constant folding yes, factoring sometimes, sophisticated symbolic manipulation? Good luck!
For instance, even seemingly ‘obvious’ opportunities like (a/b)/(b/a)
will go unnoticed by FXC. This isn’t down to the compiler trying to maintain specialcase behaviour such as divide by zero either, because it will happily replace a/a
with 1
in the absence of any knowledge about the value of a.
Apologies if that was already perfectly clear and all I’ve done is insult your intelligence, but I’ve seen some people blithely leave everything up to the compiler and not scrutinise what it’s generating. Of course, highlevel algorithmic optimisations are hugely important as well, but so is this lowerlevel stuff when a shader is being executed for millions of pixels!
Just look at what this small amount of effort has netted us:
DX9: 10, X360: 5(+3) ALU ops
Factorisation
Next we can factor fA*fA  2*fA + 3
again – this time as (fA + 1)(fA + 3)  6*fA
– to reduce the numerator of the third term to a single multiplyadd:
1 2 3 4 5 6 

I’ve also taken the opportunity to manually vectorise the addition of fA
, plus a subsequent pair of multiplications between resulting terms. In fact, the compiler does this anyway, as it’s relatively good at vectorising code. Still, one shouldn’t assume that it will always get things right!
Whether there’s a gain or not, manual vectorisation – which is often quick to do – makes it easier to sanity check the output assembly. Just scanning through, you might expect add, mul, mov, rcp, mul, mad, rcp, mul
and you’d be pretty much spot on.
So, for DX9 we’ve reduced the op count by 2, but what about Xbox 360? Here, we’ve only succeeded in shaving off one paired scalar op. However, this may turn into a real gain once the function is part of a larger shader.
DX9: 8, X360: 5(+2) ALU ops
Rescaling
This next trick involves rescaling so that the second term becomes 1/t.y
, or a single rcp
:
1 2 3 4 5 6 7 8 

You might wonder why I’m using an external constant here. Well, it turns out that FXC will misoptimise when it knows the values. Bad compiler! Again, there’s one less paired scalar op on Xbox 360:
DX9: 7, X360: 5(+1) ALU ops
Expansion
Rather than factoring terms, we could have expanded $\mathbf{\hat{f}}$ instead:
Or in code:
1 2 3 4 5 6 7 8 9 10 11 

This is a win for ps_3_0
but not for Xbox 360, as it removes the opportunity for pairing. It’s possible that some clever variation could fix this, but it doesn’t matter because we haven’t exhausted our optimisation options…
DX9: 6, X360: 6 ALU ops
Fitting
There are potentially significant gains to be had from numerical fitting, so it’s worth taking the time familiarise yourself with the various techniques, maths packages and libraries out there.
In this instance, I’m performing a cubic fit – i.e. $ax^3 + bx^2 + cx + d$ – for the 2nd and 3rd bands. Polynomials are attractive for performance because they can be efficiently evaluated as a series of mad
instructions when written in Horner form: $x(x(ax + b) + c) + d$
With careful vectorisation, this collapses to the following:
1 2 3 4 5 6 7 8 9 10 11 12 

Xbox 360 does all this in one less operation because placing 1
into r.x
can be achieved with a destination register modifier:
DX9: 4, X360: 3 ALU ops
I could present graphs showing how the cubic approximations fare, but take it from me that they are extemely close. In fact, we can arguably drop down to a quadratic fit and save a further mad
in the process. This is still acceptable:
Figure 1: Comparison between original and quadratic fit for 2nd and 3rd bands (left, right)
In both cases – cubic and quadratic – I’ve actually constrained the fitting process so that the curves go through the endpoints. This reduces the worst case error a little and maintains the nice property of exactness when $a = 1$. Of course, something has to give and so the average error is a little higher.
In practice, this quadratic approximation has little effect on the end result. When lighting with a single directional source – a worstcase scenario – the difference is slight and far less significant than the error that comes from using 3rdorder SH in the first place.
Here’s the code for the quadratic version:
1 2 3 4 5 6 7 8 9 10 11 

DX9: 3, X360: 2 ALU ops
Modifiers
And yet, we’re still not done! The DX9 figure suggests that we might pay the instruction cost of moving 1
into r.x
with some GPUs, and although it could go away when the terms are actually used, it would be cute if we could get rid of it just in case.
Notice that the two curves are monotonically decreasing and within the range [0, 1]. If we negate the intermediate result of the first mad
, saturate and then negate again, there will be no overall effect. By doing this, we can take r.x
along for the ride and force it to 0
through one of the negative constants, then add 1
via the final mad
:
1 2 3 4 5 6 7 8 9 10 

Because saturation and negation are typically free register modifiers, we save an operation:
DX9: 2, X360: 2 ALU ops
Going Green
The Wrap Shading paper doesn’t include a normalised version of Green’s model (see part 1), so here’s code for that too:
1 2 3 4 5 6 7 8 9 10 

DX9: 2, X360: 2 ALU ops
Wrapping Up
Here’s a WebGL sample that encapsulates this miniseries on wrap shading.
In conclusion, shader optimisation is critical for video game rendering, so you shouldn’t defer to the compiler. To quote Michael Abrash: “The best optimizer is between your ears”. Don’t forget it, train it!
References
[1] Green, S., “RealTime Approximations to Subsurface Scattering”, GPU Gems, 2004.
[2] Mitchell, J., McTaggart, G., Green, C., “Shading in Valve’s Source Engine”, Advanced RealTime Rendering in 3D Graphics and Games, SIGGRAPH Course, 2006.
[3] Yuan, H., Nowrouzezahrai, D., Sloan, P.P., “Irradiance Rigs”, SIGGRAPH Talk, 2010.
[4] Sloan, P.P., Nowrouzezahrai, D., Yuan, H., “Wrap Shading”, Journal of Graphics, GPU, and Game Tools, 15:4, 252259, 2011.
Appendix
Normalisation factor for generalised Half Lambert: