In discussing what a “perfect” display would be, the common view could be said like this:

Visual acuity is the limit of resolvable details. Therefore a perfect display would be at the resolution of visual acuity because the eye can not resolve the smaller details.

But recently, I have been coming to the conclusion that we should describe the problem in this way:

The human eye makes educated guesses based on incomplete information. Therefore, a perfect display needs enough resolution for our eye to make the same educated guesses, which may be incorrect.

Said another way:

Do we need enough resolution in our eye to reconstruct our display or enough resolution in the display to reconstruct our eye?

This leads to the real question:

Does every cone matter?

That is the heart of it. Does every single cone inside our eye do something that affects our vision? If so, a perfect display needs to be about 4x of visual acuity in each dimension.

How steep is that edge?
Here is a more concrete example. Let us say that we have three cones in a row detecting if there is a dip in the signal. I.e. they are resolving a feature.

In this case, cone A records HIGH, cone B records LOW, and cone C records HIGH. From this information, the eye does not know how wide that frequency is, only that it is less than the size of the distance between A and C, or twice the spacing of the cones.

If you think about it, there are two operations that happen in resolving a frequency. First, the drop in levels (from A to B) and then the rise (from B to C). But each of those features is interesting by themselves.

Here are three more cases.

In these three cases, the signal is the same but shifted over half of our spacing. All three cases have the same value for A and C, but B shifts. Is it plausible that the change in value from that single cone is enough to change our perception of the scene. My opinion: YES!

Note that the middle case is somewhat ambiguous. There are three different ways that the signal could be which cause the the HIGH-MED-LOW sampling of the signal.

If we have only this information, we do not know if the slope of the drop is the first, second, or third case. All that we know is that the falloff is thinner than twice our cone spacing. The length of the drop could be 2x our cone spacing, 1x our cone spacing, or it could be 0.01x our cone spacing.

But, if we have a 2d grid of sample points, we can use nearby rows to make an educated guess.

The middle row by itself does not have enough information to determine the width of the dropoff. It only knows that the dropoff is 2x cone spacing. But the top and bottom rows both have enough information to determine that the width must be less than 1x cone spacing, and our eye can use that information to guess the information in the middle row. Logically, our eye should be able to figure out that the width of the gradient is less than one cone spacing width.

This case has a wider line that is a little over the spacing of one cone. Our cones should be able to figure out that this gradient has wider spacing because two cones from each row are in the gradient, whereas in the sharper example zero or one cone from each row are in the gradient.

Keep in mind that our reconstruction could be incorrect.

In this situation, the middle row is actually quite blurry and that gradient is wider than one cone spacing. Since this situation is unlikely to happen in the real world, our eye will likely guess it to be a consistent, hard edge. It is impossible to accurately reconstruct this signal with our given cone spacing. But whether our eye reconstructs the scene correctly or incorrectly is irrelevant. A perfect display would need to create a signal that causes our eye to make the same reconstruction mistakes that it would see in the real world.

Note 1: Vernier acuity takes this phenomenon a step further. By looking at the slope of the line across many rows, our eye can make an estimate of where the center of that line is and detect misalignment.

Note 2: Sometimes our eye guesses wrong. When you see a very thin, sharp line, our eye estimates that as a wider, softer line. If we want to simulate reality, we need a high enough display to accurately stimulate our cones to make those same incorrect guesses.

Note 3: The world is not point sampled. Our cones have a physical size and the projected image is slightly out of focus so a perfectly sharp line will always be slightly blurred. As long as our eyes know how much expected blur there is, they should be able to determine that falloff of the original line is less than the cone spacing.

Note 4: Does this actually happen? Not sure. But it seems plausible, and I have yet to see proof either way.

Edge Sharpness Acuity
Here is another interesting question: What is “Edge Sharpness Acuity”? It is not actually a scientific term–I just made it up. But what is the resolution where we can tell the difference between a sharp edge and a blurry edge? Has anyone actually studied this? It seems like the kind of thing that someone probably has studied but my google skills failed me. Keep in mind, the usual definition of “sharpness” is resolvable resolution, whereas I am talking about the slope of an edge.

Theoretically it should be possible to determine if a an edge is within 1x the spacing of our cones. If we can detect two edges within 2x the spacing of our cones to resolve a signal, then it seems plausible that we can detect a single edge within 1x or our cone spacing, especially given the nearby information of other cones.

Anecdotally, this seems like it might be true. When I look at a 4k display, it seems obviously better than a 2k display. But it could be that the 4k display had better content. Even though I can not resolve smaller points, it is possible that I can tell that the edges are “sharper”. Then again, maybe I am just the victim of confirmation bias because I want to believe that 4k displays matter. Maybe I was fooled when the sales guy at Best Buy told me that 4k displays have “30% more color”. He actually said that.

But if “Edge Sharpness Acuity” exceeds Visual Acuity, then we definitely can not fake this effect with supersampling on a lower res display. While edge aliasing can be detected at 5x to 10x the resolution of visual acuity, it seems like we could just render with high MSAA and solve the problem even with a display that is only at the resolution of visual acuity. But if the sharpness of the edge can be detected at higher than the resolution of visual acuity, the only way to fool our eye is with a higher res display.

Conclusions
This discussion gets me back to the original question:

Does every cone matter?

Those charts you have seen about the importance of resolution vs viewing distance usually assume that each cone DOES NOT matter, and that we only need displays to exceed visual acuity. But if every cone does matter and every cone does affect how our eye guesses at what the scene should be, then we need to go higher.

We can arrive at visual acuity by saying that the cones in our eye need twice the resolution of our screen. But is that backwards? If we reverse this problem and want to recreate the sampling for every single cone, then our display needs twice the resolution of the cone spacing in our eye.

So that is where 4x comes from. We need to double visual acuity to get back to cone spacing, and we need to double it again because our display (the sampler) needs twice the resolution of the ground truth surface (the cone spacing in our retina). To have a perfect display that is indistinguishable from real life, my conjecture is that the upper bound on screen resolution is 4x visual acuity in each dimension.

Said another way, here is that image again with three values for A, B, and C.

We need to have a display that is high enough resolution to create a signal that preserves A and C but allows us to twiddle the value of B. Based on Nyquist-Shannon, that signal would need to have twice the resolution of our cone spacing.

Finally, I need to make a few caveats:

  • I am making a very weak statement. I am not saying that we definitely need a display that has 4x visual acuity in each dimension. Just that we might need that much resolution, and we need to fully study hyperacuity before we can definitively say what the upper limit on resolution of a screen needs to be. In this post I am not trying to prove anything. Rather, I am explaining my conjecture that 4x visual acuity would be enough. But it is just that, a conjecture. Saying that the eye could theoretically act a certain way based on cone placement does not mean that it actually does.
  • 4x visual acuity might not actually be compelling. It is plausible that there are effects which we can see all the way up to that resolution, but those effects are unimportant and not worth the cost.
  • Even with a display at 4x visual acuity, we would still need anti-aliasing. Vernier acuity is from 5x to 10x visual acuity, so even on a display with such tightly packed pixels we would need anti-aliasing.
  • This discussion ignores temporal resolution. Our eyes are much more sensitive to crawling jaggies than static jaggies. Eyes are constantly moving (microsaccades). The two phenomena might be related. It is not impossible that our eye can somehow use temporal information, but I have not seen strong evidence either. I find it hard to believe that temporal information from microsaccades can determine information about a scene that exceeds 4x visual acuity. My guess would be that the eye does not know exactly how far it has moved each “frame” and reconstructs the estimated position from the visual information. But I do not know of any studies that prove or disprove this possibility.

Finally, it seems like this phenomena should have been studied by somebody. The perception of screen resolution should be important to many large electronics companies. So if you have seen any information that either proves or disproves this conjecture then please share!

What resolution does a display need to have before you can not perceive aliasing? The discussion usually goes something like this:

1. The human eye has a visual acuity of about 1 arc minute.
2. If a display has a resolution that exceeds 1 arc minute, it exceeds visual acuity.
3. Therefore, such a display exceeds the resolution of the human eye. We don’t need to worry about aliasing, and there is no use in having a higher resolution display because they eye can not perceive it.

This argument has a major logical flow: Statement #3 is incorrect. “Visual Acuity” is completely different from “What the eye can see”, also known as Hyperacuity. If you want a display that is so good that it exceeds everything that the eye can do, then you need to exceed Hyperacuity, not Visual Acuity.

Visual Acuity
First up, what is visual acuity? It’s the same concept as resolvable resolution. The image from the scene goes through the lens of the eye and gets projected on the retina.

The above image comes from the wikipedia page on Color Blindness. The left shows the distribution of foveal cones in normal vision and the right shows the distribution of a color blind person. But the important thing to note is the spacing of the cones.

By definition 20/20 vision means that the viewer can detect a gap in a line as long as that gap is at least 1 arc minute. Note: An arc minute is 1/60th of a degree. In other words, Visual Acuity is the same as resolvable resolution. Scientists have measured the physical spacing of the rods and cones in the eye (more importantly, the cones of the fovea) and people with visual acuity of 1 arc minute tend to have their cones spaced about half that distance apart.

Visual Acuity in practice is closely related to the Nyquist limit. With your cones spaced 0.5 arc minutes apart, you need three cones to determine that there is a gap. Two of the cones need to be “on”, and you need one to be “off” in between them. The smallest detail that can be resolved is the distance between the centers of those two “on” cones. A cone spacing of 0.5 arc minutes gives you visual acuity of 1.0 arc minutes, just as the Nyquist limit says.

Here is an example that I have used before. If you look closely at the crosses you should be able to see that some of the crosses are solid and others are made of small dots. Stand as far away from your computer screen as you can but still see the difference. Try to figure out the threshold where the dots in the crosses are not resolvable to your eye. At this distance, if this screen had 2x the resolution it would not matter because the screen resolution is already beyond the resolvable limit.

Edge Aliasing
Now look at this image. You should clearly be able to see the edges that have aliasing versus the ones that are smooth. Go back as far as you can from your screen until you can not see any difference. Try it.

If you are a human, you should be able to see the stair-stepping in the edges at a much greater distance than you could resolve the dots in the crosses. So what is going on here? Both features are one pixel wide. But for some reason that edge aliasing feature is much stronger. It’s a phenomena known as Vernier Acuity, and it is a form of Hyperacuity. In fact, we can detect slightly misaligned thin lines with 5x to 10x more precision than Visual Acuity.

What about signal theory?
How is this effect possible? Signal theory is very well studied, and it tells us that we should not be able to resolve features that are smaller than 1 arc minute (assuming we have 20/20 vision with cone spacing of 0.5 arc minutes). Yet, somehow we can detect aliasing well below that threshold.

Is signal theory wrong? Did Nyquist and Shannon make a mistake in the sampling theorem? Of course not. The problem is that the Nyquist-Shannon theorem has one major caveat: NO PRIOR KNOWLEDGE! The sampling theorem is based on the assumption that you are trying to reconstruct frequencies from samples and nothing else. But if you have prior knowledge about the structure of the scene, then you can determine information about your scene that exceeds the resolvable resolution of your samples.

Example #1: Subpixel Corner Detection
A trivial example is subpixel corner detection. If you have ever used OpenCV, you have probably calibrated a checkerboard like this.

In most cases you should be able to detect the location of the checkerboard corners with subpixel precision. Of course, without prior knowledge this would be impossible. But with prior knowledge (we know that it is a checkerboard with straight lines) we can accumulate the data and discover subpixel information about our scene.

Example #2: Edge Aliasing
The Wikipedia article on Vernier Acuity has this nice image which in a way explains the crosses vs lines image above. The top two features demonstrate resolvable resolution. In order to determine that the two features are separated, we need to have a complete mosaic element (a cone) in between them to resolve that difference.

However, for the two features at the bottom, we can tell that they are misaligned. Note that due to imperfect focus, each foveal cone will return the approximate intensity of the blurred feature on that cone. From the relative intensities, the eye can determine where the center of the line passes through the mosaic pattern and figure out the misalignment with high accuracy. This phenomena is called Vernier Acuity.

Example #3: VR and Stereoscopic Acuity
If you work in VR then you should also be aware of Stereoscopic Acuity, which is another form of Hyperacuity. In the tests cited by this wikipedia article, the typical detectable interocular separation of two eyes is 0.5 arc minutes, or about half of visual acuity (1 arc minute). At a range of 6 meters, the detectable depth would be 8cm. If we want users to be able to perceive virtual worlds with the same depth precision as the real world, that likely means that we need to hit double resolution of visual acuity. Getting the resolution of a VR display to match Visual Acuity is going to be hard enough, but we potentially need to quadruple that (2x in 2 dimensions) to achieve a perception of depth that matches reality.

Although it depends on the type of stimulus. In the Other Measures section, for complex objects stereo acuity is similar to visual acuity. But for vertical rods it can apparently be as low as 2 arc seconds (1/30th of an arc minute). In other words, 30x as strong as typical visual acuity. Call me crazy, but I don’t think we’ll be seeing VR displays with a resolution of 2 arc seconds any time soon.

Conclusion #1: “What the eye can see” greatly exceeds Visual Acuity
Hopefully it is clear that the eye can see features that exceed the limits of resolvable resolution. To completely fool the human eye, having enough resolution to exceed Visual Acuity is not sufficient. Rather, we need to have enough resolution to fool the heuristics and pattern matching capabilities of our complete visual system. To hit that threshold we need to exceed Hyperacuity, not Visual Acuity.

Conclusion #2: We still need AA in games?
Um, yes. It is not even close. If we want to have displays that are so good that we don’t need to worry about aliasing, then we need between 5x to 10x higher resolution than Visual Acuity.

Conclusion #3: What resolution is “enough”?
One counterargument is that we do not actually need to create high resolution displays to exceed the limits of hyperacuity. For example, if we were to create an image with perfect super-sampling then we should be able to downscale it and fix all of our aliasing issues. Would that resolution be good enough to fool our abilities to detect aliasing? For VR, would exact (and expensive) supersampling allow us to achieve full stereoscopic acuity? Would that strategy allow us to exceed all of forms of Hyperacuity in our visual system? Would the image still seem sharp? I’ll talk about it a little more in my next post. The answer: Definitely maybe.

But if we want to claim that a display with resolution X is enough to exceed “what they eye can see”, we need to prove it. We need to understand all the different forms of hyperacuity and have a reasonable explanation for how all those forms of hyperacuity can be recreated with such a display. We also need to verify these findings with clinical studies if we really, really want to be sure. But it is not acceptable to say resolution X exceeds Visual Acuity, sprinkle some pixie dust on it, and claim that it exceeds “what the eye can see”.

Here is an animation test for the latest Gonch test head. The animation comes from the optical mocap shoot that I did about a year and a half ago. The data was never meant to be used on a head like this so it was a bit of an adventure getting it to work!

This data was originally used for the GDC presentation I gave last year. Mocap Militia (mocapmilitia.com) delivered the original animation solved to the previous head rig. My solver then took the joint movement data and converted it into micro blendshapes.

This head rig is completely different. The Gonch head is a more usable blendshape rig with shapes roughly corresponding to FACS poses. So the joint animation data had to be transformed into blendshapes on a different head with a different topology. Ideally I would have liked to do a new shoot where the mocap vendor solves directly to this set of blendshapes. But as you all know, sometimes you have to make the best of the data that you have.

Of course, if you take optical motion capture data and solve it directly to blendshapes then you need talented animators to clean up and improve the data. Unfortunately, there was no budget for that so the solver was used as-is. There is about 8 minutes of data so given how quickly an animator works, you can run through the numbers on how much that would cost. So I had to roll up my sleeves and learn about Maya animation layers. The only changes to the data were to the blinks/squints and for the rigid transform of the head.

This model comes from a tall guy. Gonch is 6’9″, whereas Matthew Mercer (the voice actor, @matthewmercer) is normal height. Also, I’ve noticed that voice actors tend to have very flexible faces. Due to their different facial structures the solver is definitely choosing some incorrect lip poses around the mouth. In particular the corners of Matt’s lips move in very different ways than Gonch’s for the equivalent FACS poses. This problem would be fixed by solving the original mocap to the voice actor’s FACS poses, and then applying those FACS poses to the rig. But we never captured raw FACS poses of Matt so I had to settle for a pipeline that was duct-taped together using joint translations.

The Maya setup was very simple. Each clip has its own maya file. The head has both a skin deformer (for the neck) and a blendshape deformer with all the shapes. I also threw together a simple Maya python plugin to give me a control for the jaw as well as driving the eyelid movements based on the aim constraint of the eyes. The eye target was keyframed by hand based on the reference video. It is very exciting to finally test this head on real animation but there is a lot of room for improvement.

And if you would like a rig that looks like this one, just bring your talent to my studio and I’ll make one for you!

Why does the Uncanny Valley exist? Why is it so difficult to cross? Why is it that all the tricks we use for making CG things like rocks, houses, and cars do not work for faces? The usual answer is something along the lines of “Faces are hard” or “As human being we are experts at understanding faces”. That just tells us the symptom, not the cause. What is the real, physiological reason why the Uncanny Valley exists?

In my opinion, we can understand this problem by looking at Visual Agnosia and Prosopagnosia. Let’s start with Visual Agnosia. Here is a quick video about a man named Kevin Chappell who as has Visual Agnosia. Basically, he can’t understand objects. He can see the building blocks of objects such as lines, colors, and shapes. But he can not put it all together to recognize the objects. For example, he can see this thing that is long and silver but has no ability to recognize what it is. He has to use feel and context to realize that the object is a fork. However, he can somehow recognize faces.

I would recommend the whole video, but you should definitely see the parts:

  • 0m40s: Kevin discusses what he sees.
  • 2m41s: He can easily recognize faces in photos but can not make out the shapes on the vase.

However, Prosopagnosia is the reverse problem. More commonly known as “Face Blindness”, people with Prosopagnosia can make out objects clearly but are unable to recognize faces.

She performs a test at 1m00s to determine if she can recognize her mother by only looking at the face but she can not do it. She can recognize her from her clothes but not the face.

You can also take a test to see if you are faceblind:
https://www.faceblind.org/facetests/index.php

Average is 85% and I got a 78%. If you score below 50% you might be faceblind. It’s actually harder to recognize people than you think when there is no context behind it. In the beach volleyball crowd, it is a common occurrence for two people to start up a conversation in a bar and realize 5 minutes later that they already met while wearing volleyball gear. For better or worse, most people recognize me though because I’m the only 6’4″ volleyball-playing redhead in a 5 mile radius (context).

The point it is, our visual cortex uses a completely different algorithm to process faces than all other objects. Many people think that our vision works like an LCD screen where the eye records a rectangle of pixels that gets sent to our brains. Rather, the data from the rods and cones gets passed to our retinal ganglion cells. The retinal ganglions perform contrast detection and send the data through our optic nerve to our visual cortex. The visual cortex applies simple shape detection along with movement tracking and color recognition. That “feature” data then gets sent for semantic analysis. As always, wikipedia has a great article: Cognitive Neuroscience of Visual Object Recognition.

However, any information that affects faces goes to the Fusiform Face Area (wikipedia article: http://en.wikipedia.org/wiki/Fusiform_face_area). The FFA is the area of the brain which, when damaged, causes Prosopagnosia.

In the computer graphics world we are pretty good at faking things. Good artists have been trained to analyze real world objects and try to create the most realistic looking facsimile using the minimum amount of time and resources possible. That includes both content creation time and rendering time. Since time is always a constraint, good artists have learned that we want to do the minimum amount of work possible to trick our visual system. In computer graphics, we are experts at fooling our visual cortex. That’s our one and only job.

But faces are handled by a completely different section of our brain: The Fusiform Face Area. My theory is that the Uncanny Valley feeling happens when the Fusiform Face Area has a mismatch with the rest of the visual system. Your visual system thinks the scene is real but the FFA is telling you that something is wrong. To solve this problem, all we need is a better understanding of how the FFA works. We need to find out what is and is not important to the FFA, and if we can do that then we should be able to solve the Uncanny Valley.

Going big picture for a second, we have evolved for millions of years to have this separate, special, dedicated area of brain functionality in the FFA. So it probably is not doing the exact same thing as the rest of our visual system. If it was doing the same thing it would not have evolved into a separate region.

For example, something about our FFA is hardwired to detect faces that are upright (as opposed to inverted). This is known as the Thatcher Effect (wikipedia).

The inverted images both look reasonable at a glance. If you look for a while you can probably use context to realize that the right one is a little bit off. But when you see them in the correct orientation:

It looks obviously, terrifyingly wrong. That image demonstrates your FFA rejecting the image which puts it in the Uncanny Valley. The point is that the FFA is not doing the same thing as the rest of your visual system with higher quality. Rather, the FFA is using fundamentally different algorithms than the rest of your visual cortex.

I don’t think about the Uncanny Valley as making “Higher Quality Faces”. Rather, I think about it in terms of “FFA Rejection”. Our CG faces do not necessarily need to be more accurate or more detailed. Instead we need to figure out which triggers cause the FFA to reject the image. Then all we have to do is fake those triggers.

So that leads to the question: What should we do? We could just make all our faces upside down but your art director will probably veto that idea. Why is the FFA rejecting our CG images? What is the missing thing? You probably already know my answer: It rhymes with “Stud Toe”.

One of the problems that has been bothering me for several years is that it seems like there should be a way to speed up the Fresnel and Visibility functions by combining them together. In particular, Schlick-Fresnel and the Schlick-Smith Visibility term have a similar shape so I’ve done some experiments to combine them together. The results are above with the reference version on the left and the approximation on the right.

To give you an idea of the cost, the per-light code is below and I’ll explain the derivation. The preCalc parameter is derived from roughness and the poly parameter is either from a 2D lookup texture indexed by F0 and roughness or derived from a less accurate endpoint approximation.

The important thing to note is the FV line. It turns out that if we have a known F0 and Schlick-Smith roughness parameter, the combined function of FV can be modeled with an exponential of the form:

F(x) = exp2(Ax^2 + Bx + C)

Finally, the full shader source, the example DDS files, and the solver source are included at the bottom of the page under a public domain license.

1) Motivation.
As you should already know, the most common formula for Fresnel is the Schlick approximation which is:

And the Schlick-Smith visibility term uses the code:

At a glance it should not be possible to combine them together. After all, Schlick Fresnel is a function of dot(H,V) whereas visibility is a function of dot(N,L) and dot(N,V). But almost a decade ago I was looking at the MERL BRDF database. The MERL data models isotropic BRDFs using a 3D function of dot(H,V), dot(H,N), and phi (which is the angle between the projected angles of L and V). I did some testing back in 2006-ish and found that you could remove the phi term for most “simple” materials, and just use a 2D function of dot(H,V) and dot(H,N).

That inspired me to create the post last year about combining F and V together by only using dot(H,L), which is the same as dot(H,V) of course. The main idea was to create a 2D texture where you could lookup the lighting function by doing two texture reads per light. Unfortunately it does not help very much because you generally want to avoid dependent textures reads inside your lighting function. You could also use the optimized analytic formula which might save a few cycles because you can ignore the dot(N,L) term. But that does not save much either because you can calculate that value once in the beginning of your shader and reuse it for each light.

But the next inspiration was this post from Sèbastien Lagarde regarding spherical gaussian approximations. That post covers how you can approximate the function:

F(x) = pow(1-dotLH,5)

…with the function…

F(x) = exp2(Ax^2 + Bx + C)

…where A and B are derived from an optimization procedure and C is zero. That got me thinking: Any function with the same basic shape should have a good approximation using 2 to the power of a polynomial. Since the Schlick-Smith GGX Visibility function has a similar look, we can probably take take the combined Fresnel and Visibility function and solve to it. From here I tried several approaches that all seem to get good results with minor differences.

2) Endpoint Matching
The first idea is to just get the endpoints to line up. The combined FV function is at a minimum when the light and viewer are at the same position and dot(L,H)=1. In this case, Fresnel is F0, Visibility is 1, and the combined value is F0. The FV function hits its peak at grazing angles and dot(L,H)=0. Fresnel is 1, Visibility is some value which we can call V90, and the combined function is V90. Given some value dotLH, here are the endpoints for our combined FV function. We just need some reasonable function in-between.

FV(1) = F0
FV(0) = V90

We can make this happen with the following function:

x = 1-dotLH
A = log2(V90/F0)
B = 0
C = 0
FV = F0 * exp2(A*x*x + B*x + C)

C has to be zero because exp2(0)=1. With any nonzero C value our function will deviate from F0. We could optimize B though. As long as B+A=log2(V90/F0) we will hit the correct endpoint, but we will get to that later.

The one obvious issue when trying this function is that some roughness values are too bright. It turns out we do not necessarily want to exactly hit the V90 endpoint. The grazing angles at 80 or 85 degrees are much more visually important than at exactly 90. Thanks to the asymptotic nature of Schlick-Smith with low roughness values this function will overestimate the value at 80/85 to hit that very high V90 value. Going the other way, higher roughness values (like .5 or .6) are too dark. To solve these problems I created a function that tries to get a better perceptual fit for A based on roughness. The code is below as an optional tweak.

3) Linear Solver
There are two obvious problems with the endpoint approach. First, while the endpoints match exactly, the middle of the curve does not look quite right. Second, we could use B and C to get a better fit. We can address this using a linear solver.

Given our ground truth function FV(x), we are trying to solve for A, B, and C in the following function:

FV(x) = F0 * exp2(Ax^2 + Bx + C)

That function is nonlinear and requires a bit of work to solve. But with a divide and a log we can reduce the function to a simple polynomial:

log2(FV(x)/F0) = Ax^2 + Bx + C

That function can be easily solved with polynomial curve fitting. Note that while the original function would minimize the squared error, the adjusted function will minimize the squared log error. This is an unexpected benefit since the human eye’s perception of light is logarithmic anyways. Log error squared is probably a better choice than linear error squared.

There are several tweaks that we can perform to optimize the solver. The first question is if we want to constrain C or not. If we let C float around then we will get overall better error but F0 will change. Or we can constrain C and have more error in the middle angles. In my opinion, the constrained version gets a better result but feel free to experiment. The results will also depend on how you weight the samples. It’s simply not possible to get an exact fit so error will have to be somewhere. The problem is finding the least-worst place for that error to go.

Another tweak we can make to the solver is to limit the really strong grazing angles. For the low roughness values the very high grazing angles (88, 89, 90) tend to overpower the lower grazing angles so at roughness 0 we can clamp the angle at 85 degrees. That clamp function lerps so that by roughness .5 the largest angle is back at 90 degrees.

We can also optimize out a few instructions. Our full function looks like this:

x = 1 – dotLH
FV = F0 * exp2(Ax^2 + Bx + C)

First, we can get rid of that F0 term using the following rule:

A*exp2(B) = exp2(B + log2(A))

That removes F0 by adding log2(F0) to C. Also, if we are on a GPU where divide is more expensive than rcp(), we could bake preCalc.y into C as well and replace the divide with rcp().

Second, we can remove the 1-dotLH by plugging the values in and multiplying it out.

F(1-x) = Ax^2 + Bx + C
F(x) = Ax^2 + (-2A-B)x + (A+B+C)

Combining all those steps together allows us to calculate a combined FV value in a single exp2() and two fused multiply adds.

The sample code saves the data as a 128×128 texture for roughness and F0. The precision gets a bit low for F0 because dielectric values are between 0.02 to 0.05. So the y axis indexes to F0^2, not F0.

In the final code, you first would calculate your roughness and F0 value. Then apply several small tweaks based on your roughness value to save a few instructions when calculating D.

You would also have to do a table lookup to fetch your polynomial data. Note the sqrt() around F0.

From there you would use this function to accumulate your lighting.

And you would call it like so:

4) Conclusion
That should be all the important details of the approximation. I’ve found that the best results come from the constrained solver version. When the dot(L,H) is near one as in this sample it looks nearly exact. You can click on the image for a larger version.

The middle is the reference GGX version, the right is the optimized version, and the left is the difference with levels applied.

When you get near grazing angles the results show a little bit of loss but are still quite acceptable. At grazing angles you can see a difference when you flip back and forth between the reference and the approximation. But if I was looking at a single version by itself I’m not sure if I would be able to tell if it was the reference or the approximation.

With a solver like this, it is interesting to think through how many levels of approximations we are doing. This FV function is not an approximation of the ground truth FV function. Rather, Schlick-Fresnel is an approximation of the ground truth Fresnel function. Schlick-Smith GGX Visibility is an approximation of the true GGX visibility function. So this FV function is an approximation of the product of an approximation and another approximation.

However since we are approximating the function numerically we can easily change the original function and re-solve. If you have a ground truth for both Fresnel and Visibility it would be interesting to solve for those together and see if the result is more accurate than multiplying the approximate Fresnel and approximate Visibility terms together.

We could apply the same approach for any function. There are quite a few functions in computer graphics which have a similar shape and might be well approximated by taking the exp2() of a polynomial. It is trivial to solve to a higher order polynomial if the situation calls for it.

The major downside of this approach is that you do not get F as an intermediary value. If you are conserving energy and want to multiply your diffuse by (1-F) then this approach will not work. It also will not work if your F0 is a color that lerps to white at F90. Technically it will work, but you will have to do it 3 times (one for each channel) which would probably be slower than the reference function.

That being said, there are times when you are really tight on cycles. Mobile systems are becoming graphically comparable to the PS3 and XB360. If we want something better than Blinn-Phong but can not afford full PBR functionality then this approximation might be a good tradeoff. Another application is VR rendering where you have to push a staggering number of pixels per second. High quality VR is going to be tricky even with the most expensive PC that money can buy.

Finally, here is the zip file.
GgxPolySolver.zip

It contains:

  1. Shader code.
  2. Four variations of the solved polynomial textures.
  3. C++ source for building these textures

The files are available under a public domain license so feel free to use them however you wish. Please try it out and let me know if you get any good results!

Are you looking for high quality heads? If so, I’m pleased to announce that after many long days and nights that I’m launching a 3d Facial Scanning service in LA for Video Games and VFX.

The primary advantage of the Filmic Worlds scanning service is that I have a specialized pipeline to create usable rigs. In short, you bring in the talent, and you get a rig that looks like this:

This scan is of Eric “Gonch” Goncharenko who very kindly agreed to lend us his face for this test. The video is from a DirectX 11 demo showing what the results should look like in an actual game (more realtime details at the bottom of the page).

With most scanning workflows you start with the “raw” scan which is usually the result of photogrammetry software. Then you have to manually clean each scan, align your base topology to the rig and sculpt every expression. Whereas with these heads, your delivery includes each shape solved to your base topology. Rather than using off-the-shelf software, this pipeline uses a custom stereo solver which will be discussed more in later posts. In other words, you get the base topology solved to each scan which you can import directly into Maya (or any other program).

A face shape is much more than just a mesh. The delivery also provides a displacement map, normal map, and diffuse map for every shape. So if you scan 100 shapes then the delivery will include 100 low-res meshes, 100 displacement maps, 100 normal maps, and 100 diffuse maps.

Of course, after solving your topology to the initial scan many more steps are required to create a usable rig. So there is a pipeline in place to perform several mesh cleanup tasks:

  1. Symmetry: Symmetry is tricky. If you are handling large volumes of data, it is important to add some element of symmetry or else you will go crazy. For example, you want the eyes to be the same height. But you need to preserve some asymmetry to retain the essence of the character. So this pipeline verifies that the contour of the eyelid and the interior edge loop of the lips is symmetric while preserving the other areas of the face as much as possible.
  2. Isolation: Each pose should only affect a certain area. A forehead expression should not move the jaw. This step is somewhat subjective, so I always try to err on the side of a region that is too large as opposed to too small.
  3. Alignment: During the scan the talent’s head will always be moving. Each pose needs to be tracked and aligned.
  4. Eyes Placement: The eye geometry needs to be placed in the eye socket.
  5. Eyelid Alignment: The eyelids need to align with the eyes. I do this by specifying an “alignment edge loop” for each eye and constraining it to the eye geometry.
  6. Teeth Placement: The service involves placing the teeth.
  7. Mouth Cavity: The mouth cavity needs to be adjusted to have enough space for the teeth and to prevent the back of the lips from going through the teeth.

These steps are all performed by the pipeline, and the final result is delivered to you.

Once you have this data, you will need to figure out how to animate it. Fortunately, the data is fundamentally just blendshapes. The low res meshes form a blendshape rig, and you can animate that rig using standard animation tools. To animate the textures, you can reuse the weights from the blendshapes. So if the smile mesh has a weight of 0.7, then you would also apply a weight of 0.7 to the smile’s diffuse, normal, and displacement maps.

In my opinion, we need to do more work on animating the diffuse maps in our facial rigs. To show why, here is a video demonstrating the transition between smile and sneer with the diffuse map turned on and off. I highly recommend watching them on vimeo at a higher resolution.

The explanations for the shots in the video are:

  1. Full Animation: The full head rig with all texture animation enabled.
  2. Uncanny Animation: The same head, but with the diffuse map animation disabled. Note that you can still see a little bit of detail in the nasolabial fold since the animated AO is still on. This animation looks uncanny to me.
  3. Full Animation vs Static Diffuse: Shows a comparison of the animated diffuse map vs the neutral pose. Note the changes in redness around and above the nasolabial fold and the nose.
  4. Full Animation vs Static Geo: As another comparison, we can leave the diffuse map animation on but turn off the animated geometry to visually isolate the color differences.
  5. Full Animation vs Static Geo: Same as before but with full shading enabled.
  6. Full Animation vs Static Diffuse: The comparison between full shading with animated diffuse maps and a static diffuse map.

To get another perspective, here is the same video at 3x speed:

The point is that the diffuse map undergoes significant changes as the face animates, and this animation is more than just wrinkles. The primary changes seem to be related to blood flow and skin stretching. The approach I took for this demo is to have every pose include its own diffuse map. Another option is to create a fluid sim or “blood flow masks” and use the diffuse maps as reference. There are many options for using this kind of data.

Finally, here are some stats on the realtime demo. It was designed to fit in the constraints that a real game would have. The geometry is just blendshapes so it is compatibable with any engine that supports them. The head geometry is not very high: About 7300 verts. The total texture memory is 55MB which is high but not crazy. In a real game you would probably sacrifice the largest mip which would bring it down to under 14MB. The raw data is gigabytes in size and PCA compression is essential to bring that size down.

The framerate depends on the viewport. On my NVIDIA GTX 660 Ti at 1080p the GPU renders in around 2.2ms when the head touches the top of the frame and the base of the mesh touches the bottom. Those numbers are based on DirectX 11 timestamp queries so I don’t really trust them, but they seem reasonable enough. Finally, the lighting is just a single directional sunlight with a shadow and an IBL for the ambient. The only thing in the demo which a real game probably can not do is 8x MSAA.

So that’s it. If you need to scan some actors and/or models please get in touch through email. Also, I’ll be at GDC Wednesday through Friday so if you have any questions I’d be happy to chat.

Good luck with your heads!!

The Uncanny Valley is something that we have all dealt with and/or thought about in computer graphics. It has become a sort of bogeyman used to scare CG artists and graphics programmers in the same way that monsters under the bed scare little children. “Don’t do that or the Uncanny Valley will get you!!!!” Like all things that we are irrationally scared of, we can put that fear to ease by really analyzing what it is, where we are, and what we can do about it. The TL/DR version of this post: Solving the Uncanny Valley will not be easy, but we shouldn’t be scared of trying.

What is the Uncanny Valley
If you really have not heard of the Uncanny Valley, then do a google search. The short version is that recreating human faces in CG is hard. A human likeness that is unrealistic (or cartoony) is easier to relate to. But as human likenesses get nearly real, they get creepy and hard to relate to. Then once you get real enough then that likeness can be subconsciously accepted. The base image came from wikipedia and the licensing information is at the bottom of this page.

If you have not already, I highly recommend the Wikipedia article: http://en.wikipedia.org/wiki/Uncanny_valley. It has an interesting set of theories about why the Uncanny Valley might exist. Interestingly, the first theory listed is “Mate selection”. I.e. characters in the Uncanny Valley trigger our unconscious mate selection biases because these characters look like they have “low fertility, poor hormonal health, or ineffective immune systems”. As a side note, I’ve always thought that blood flow and other diffuse map changes do not get enough attention. I suppose it makes sense in that someone with no blood flow in their face probably has something very, very wrong with their immune system!

Where is the bottom?
While we talk about the Uncanny Valley as something that we should be afraid of, it seems like no one ever talks about where we actually are.

The chart below splits the valley into two sides, the descent and the ascent. If we are on the left side (red), as we get more real we are going deeper and deeper into the horrors of the uncanny valley. In this section, everything that we do to make the characters look “better” will actually make the characters less relatable. We should make games better by giving up, going back, and looking more cartoony.

Then again, if we are on the ascent side (blue) then we have already bottomed out. We have already created the worst, creepiest, most unrelatable characters possible. To make our game look better we need to slowly and agonizingly increase the quality of our characters. But the goal is straightforward: We can make characters look better by making them look better. Simple, right? On this side, we have passed the days of “make it look worse to make it look better”.

In my opinion, video games in the top end graphically are somewhere on the ascending side. It seems like we really hit the bottom of the Uncanny Valley in the early Xbox 360/PS3 generation. Those games had the creepiest, most uncanny characters.

Since then we have been climbing our way up. I would put the best-looking XB1 and PS4 games somewhere in this area:

It is hard to say exactly where we are. But we are definitely going the right direction. There is still much work to do, but in my opinion we are much closer to the top of the valley than we are to the bottom.

How far do we have to go?
The other interesting question about the Uncanny Valley is “How far is good enough?”. How realistic do characters need to be to become non-creepy? Do they need to be indistinguishable from reality? In my opinion, no.

I have linked to this video before, and I will link to it again. It’s from an artist named Lukáš Hajka who created a DIY Ucap system. You can see images of how it was made on the Crossing the uncanny valley WIP thread over at cgfeedback.com. He also has a tumblr page showcasing his other work.

The idea is pretty simple: You capture video of your talent and solve for textures as well as the model. Then when you play it back, you animate the diffuse map on the face. Note that it was called UCap (for Universal Capture) but these days everyone refers to the concept as 4D Capture so I will stick with that terminology.

To me, this footage crosses the Uncanny Valley. It is definitely not “Photoreal”. You can tell that it is not perfect. The rendering is just the Maya viewport with no shading. But somehow it retains that essence of the person.

What I like about this video is that it shows the purity of the algorithm. There is no lighting. No normal maps. No AO. No skin shading. Just a skinned mesh with an animated diffuse map on top of it. There are similar videos that I’ve seen which were not publicly released and have been lost to time.

There are quite a few examples of similar approaches.

  • There is the original work on the Matrix sequels which was the first well-known successful commercial use of 4D data (that I am aware of). If you know of an earlier use, please let me know. Here is an awesome making of video, and you really should watch the whole thing: Universal Capture System (UCap) Reel. The Siggraph sketches are on George Borshukov’s webpage: www.plunk.org/~gdb/. It is hard to believe that this data is 12 years old!
  • After the Matrix sequels, George Borshukov led a team at EA (which I was on) to apply the same technique in realtime. Here is the best video that I could find (from the 2006 Playstation E3 Press Conference). www.youtube.com/watch?v=DZuMMevcjHo. The chapter in GPU Gems 3 is still online: http.developer.nvidia.com/GPUGems3/gpugems3_ch15.html.
  • LA Noire used a similar technology (powered by Depth Analysis, which does not seem to exist any more). You have surely already seen it but here is the tech trailer: LA Noire – Tech Trailer
  • Dimensional Imaging (www.di4d.com) provides tools and processing if you want to go this route. They also generously provide sample data: www.di4d.com/sample-data/. They have a long list of games, movies, and trailers which used their technology. But it is unclear how the data was actually used for each application.

Of course there are many reasons why this kind of data is hard to work with and/or prohibitively expensive, but that is not the point. There is something going on in that diffuse map that tricks our brain into accepting it. We need to understand why 4D data looks so good, and apply those learnings to our facial rigs. To me, the answer is simple: We need better animation in our diffuse maps. Unfortunately it is hard to conclusively know exactly what is happening because of baked-in lighting and UV sliding. For example, if the UVs are sliding, it means that the projection of the diffuse map is compensating for the inaccuracy of the geometry, and we should fix the problem with better geometry. But if the diffuse map is stable but changing color, it means that there are color changes that we are missing. Then again if those color changes are due to baked in lighting then the real problem might be the geometry or skin shading. Whatever it is, we need to understand it.

Conclusions.
So my main point is simple: We are missing something, but we should be able to figure this out. Using the current techniques for facial animation (bones + wrinkle maps + blendshapes) is not enough. Simply doing more of the same (more bones, more wrinkles, and more blendshapes) will not get us there. We need to do something different. And given the information that we have and the high quality capture solutions available to use, we should be able to figure out what that something is.

Most importantly, we can do this! I truly believe we can make realtime rigs that cross the Uncanny Valley. But it will take hard work, dedication, and a very thorough analysis of reference.

Images Licensing:
All Uncanny Valley images on this page are derived from the image provided by Masahiro Mori and Karl MacDorman at http://www.androidscience.com/theuncannyvalley/proceedings2005/uncannyvalley.html and are licensed under the GNU Free Documentation License, Version 1.2.

Recently, I was sitting in an office and I noticed some really cool light shafts that seemed to be dancing on the ceiling. Here’s a picture.

Looking closer, the tree outside is also casting a shadow.

Huh??? How is the sun casting a shadow on the ceiling? I was pretty confused for a few seconds there.

Then it became obvious: The shadows on the ceiling were the specular reflections from cars on the road. Here is a picture from that window looking down. The office is on the second floor.

In the photo we can see a specular highlight on the car. Based on the photo you would think that the specular highlight is about as bright as the sky since they are both overexposed. But in reality that little specular highlight is many times brighter than a point on the sky. I’m always amazed at how bright specular highlights can be.

Just to keep things straight, here is a diagram.

The path of light is:

  1. Light comes from the sun.
  2. Reflects off the car as a specular reflection.
  3. Gets shadowed by the tree outside.
  4. Goes through the window.
  5. Hits the ceiling.
  6. Turns into diffuse light to be seen by the viewer.

Let’s suppose that we wanted to simulate this scene in realtime on a console. We have a bunch of cars zipping by. Each car is causing a specular reflection. But cars are curved so the reflected specular light gets spread out. Then that highlight needs to be shadowed by the tree and window to hit the ceiling. How would we do that? I have no idea. Wait for the Playstation 8?

So yes, specular reflections can be very, very bright. To do GI properly, every single car is bright enough that it needs to be accounted for. Also, many other dialectric surfaces (especially smooth ones) become very specular at grazing angles. Cars aren’t the only objects that cause specular reflections.

In summary, accounting for specular in global illumination is hard. In other shocking news, water is wet.

Spherical harmonics rotation is one of those problems that you will occasionally run into as a graphics programmer. There has been some recent work, most notably Sparse Zonal Harmonic Factorization for Efficient SH Rotation (Project , PDF) which was presented at Siggraph 2012. In games we usually care about low order SH, especially 3rd order. According to Table 2, it takes 112 multiplications to rotate 3rd order SH by the zxzxz method, and 90 by the improved Zonal Harmonics method. I’ll show you a simple way to do it in 57, and source is provided at the end of the post.

As mentioned in the last post, one way that you can improve the lighting quality is with spherical harmonic ambient occlusion. In my case, I have a skinned mesh with baked spherical harmonics on each vertex. But as the mesh moves the spherical harmonics have to move with it. And since the mesh rotates the SH vector has to rotate too.

For this post, I’m assuming that you understand the basics of spherical harmonics. If you don’t, the best place to start is with Peter-Pike Sloan’s excellent presentation Stupid Spherical Harmonics Tricks (Slides , PDF). It contains everything that you really need to know about spherical harmonics.

As a quick review, spherical harmonics are split into bands. Band 0 has 1 coefficient, band 1 has 3, and band 2 has 5. You can see visualizations of the bands in the image below (which is from the PDF). Also, when we say “Order N” we mean the first N bands. So “Order 3” means “The first 3 bands”, which includes band 0, 1, and 2.

The most important property of Spherical Harmonics is rotational invariance. No matter what direction the light comes from the projected image looks the same. The image below (also taken from Stupid SH Tricks) compares SH to the Ambient Cube from Half-Life 2. With the Ambient Cube a light coming directly along the X/Y/Z axis is much brighter than a light coming from an angle whereas it looks the same from any angle with Spherical Harmonics. The formulas for projecting a normal vector into spherical harmonics are in the appendix of Stupid SH Tricks.

Each of the bands is independent. If we want to rotate 3rd order SH then we need to rotate bands 0, 1, and 2 separately.

Finally, each band can be rotated by a linear transformation. In other words, band N will have 2N+1 coefficients, and we can rotate that band with a square matrix of size 2N+1.

In summary, here are the important properties for rotating spherical harmonics:

  1. A light direction vector can be projected into spherical harmonics with a simple, closed form solution.
  2. A direction projected into spherical harmonics looks the same regardless of which direction it comes from.
  3. We can rotate spherical harmonics with a linear transformation.
  4. Each band is rotated independently.

Getting to the point, if we have an SH vector and a 3×3 rotation matrix M, how can we rotate the vector? There are many options to do it. We could:

  • Rotate around Z and rotate 90 degrees with a closed form solution. So we could decompose our matrix into Euler angles, and multiply by 5 sparse matrices. This is the zxzxz solution.
  • Use a Taylor series to approximate the rotation function (as in some PRT work). This option has problems with large angles.
  • Recent work (mentioned above) involves factorizing into Zonal Harmonics.

As mentioned before, for 3rd Order the sparse matrix zxzxz solution requires 112 multiplications, the sparse ZH solution requires 90, but we can do it in 57. So what’s the trick?

First, for band 0 we don’t have to do anything because band 0 is just a constant and has no direction. Band 1 is a simple matrix multiplication. So we’ll focus on band 2, but in theory this approach should work for any band. The trick is that rotation followed by projection is the same as projection followed by rotation.

Let’s define a few things.

  • x: our SH band 2 that we want to rotate. It has 5 components.
  • P: A function which projects a normal vector into band 2. So it takes a 3 component normalized vector as input and outputs a 5 component SH vector.
  • M: Our 3×3 rotation matrix. It’s the rotation that we want to somehow apply to our SH vector.
  • R: The 5×5 (unknown) rotation matrix that want to apply to x.
  • N: Some 3D normalized vector.

As mentioned before, if we rotate a vector and then project it into SH we would get the same result by projecting it into SH first and then rotating it. We can describe this algebraically as:

R * P(N) = P(M * N)

When you think about it this way, solving for R is easy. We can do this same operation for 5 vectors and solve for R.

R * [P(N0), . . . ,P(N4)] = [P(M * N0), . . . ,P(M*N4)]

We can clean this up a little bit by setting a matrix A to the left side, as in:

A = [P(N0), . . . ,P(N4)]

So this becomes:

R * A = [P(M * N0), . . . ,P(M*N4)]

And as long as we choose our normal vectors so that A is invertible, we can solve this directly. Which turns into:

R = [P(M*N0), . . . ,P(M*N4)] * A-1

That’s it. Since our normal vectors are chosen once we can precalculate invA as A-1. Also, we don’t actually want to calculate R. Rather, we want to multiply our SH vector x by it. The final formula is:

R * x = [P(M*N0), . . . ,P(M*N4)] * invA * x

The final algorithm to rotate our SH vector x by the 3×3 rotation matrix M is:

  1. Multiply x by invA
  2. Rotate our 5 pre-chosen normal vectors by M
  3. Project those rotated vectors into SH which creates a dense 5×5 matrix.
  4. Multiply our the result of invA*x by the new dense matrix.

If you look at the Zonal Harmonics paper you will see that this algorithm is almost identical. But the advantage is that we can get sparser data. The Zonal Harmonics paper is restricted to finding, you know, Zonal Harmonics. We can just choose 5 vectors out of thin air and it works as long as the projections of those 5 vectors are linearly independent. So we’ll choose these vectors:

And here is what our invA looks like. Inverting a sparse matrix will not necessarily preserve sparsity but in this case it does.

But the really nice thing is that most of these terms end up cancelling out. In the optimized version, the sparse matrix calculation just requires one multiplication. We can divide the whole matrix by k0 and multiply by it at the end which turns most elements of the matrix into ones.

We have to rotate our 5 normal vectors by our 3×3 matrix. In the general case, a 3×3 matrix multiplied with a 3 component vector is 9 multiplies and 6 adds. But two of our vectors don’t need any operations and the 1/sqrt(2) cancels out. So multiplying all 5 normal vectors by M is just 9 adds.

During the projection step the constants in front of each term cancel out too. In the general case projecting into band 2 requires 14 multiplications. Due to canceling of terms, we can skip the 5 multiplications by constants. Projecting and accumulating each vector turns out to require 9 multiplications each.

Then we have a few multiplications on the end and we are finished. All in, rotating band 2 requires 48 multiplications, band 1 requires 9, and band 0 is free. So that’s 57 multiplications for full 3rd Order rotation.

The actual performance gained is heavily dependent on the architecture. On hardware that has a fused multiply add some of these multiplications would have paired with an add anyways so nothing is gained. But the algorithm probably has fewer total instructions regardless.

Finally, here is the source code. Please try it out and let me know if there are any issues. I included the JAMA library for the inverse. I haven’t tested with larger orders but it the same approach should work. Although for really high SH bands the zxzxz approach probably wins out since the dense matrix multiplication would start to dominate.

Source:
ShRotation.zip

As mentioned in the last post, many materials out there need some kind of custom shading model to look correct. If you wanted to do all those materials with a fat G buffer, your G buffer would quickly become way too fat. So to get that kind of shader generality on todays GPUs you probably need forward shading.

But another way to greatly help shading in today’s games is with better kinds of occlusion. In games it’s pretty standard to have some kind of ambient occlusion affecting the characters. And there are many variations of occlusion that we can do to easily prevent light leaking and make objects seem like they belong in the scene.

1. Spherical Harmonic AO

One technique that I see huge potential in is Spherical Harmonic Ambient Occlusion. I mentioned this at my GDC talk. For each vertex you calculate spherical harmonics offline and when you render you project the light into spherical harmonics. With this technique you can figure out which directions light is not allowed to come from. The image on the left is using SHAO and the image on the right is not. That is why you see light leaking on the eyes, ears, etc.

To use this technique you need 9 channels. There is no way you can include 9 extra channels in your G buffer. The technique is too expensive to do everywhere, but for close ups of important objects it makes a big difference at minimal cost.

Also, the cost could be considered negative. It’s common to need many shadowed lights when lighting characters and the cost of those shadows adds up fast. With SHAO you would still want shadows on your key lights but SHAO lets you skip the shadow on your fill lights. Which is more expensive: An optimized deferred-only renderer with lots of shadows or a forward plus renderer with fewer shadows? As we all know, it depends.

2. Diffuse/Specular AO

For the same demo there was a baked, animated AO term that moved with the eyes. One thing I did not mention was that there is a different AO function for the diffuse and specular term.

Diffuse and specular light act very differently based on occlusion. Diffuse light leaves the surface in all directions but specular light is much more likely to leave at grazing angles (due to fresnel). So when we are in a cavity the specular term should be affected much more strongly by AO than the diffuse term. A small increase the amount of occlusion causes a much stronger effect on specular than diffuse light. Naturally, this means that you often want a stronger/different AO for diffuse and specular.

3. Analytic AO

In some situations you have a simply analytic formula for the occlusion affecting your surface based on the light direction. One of the more annoying artifacts is lights from behind the head lighting up the teeth and tongue. In fact, this problem is more common this generation than last generation because proper physically based specular will be much brighter at grazing angles and cause more artifacts than our bad, non-physically based lights of generations past.

One stupidly simple technique to minimize this problem is say that lighting inside the mouth can only come from the front. Let’s say you know the forward vector for the head, which we will call F. Then you can modulate the light brightness by saturate(dot(L,F)). It’s very cheap, very simple, and greatly helps. But it’s very difficult to do this with a G buffer.

4. Polynomial Texture Maps

One of the graphics techniques that games never adopted and could use another look is Polynomial Texture Maps. The key idea is to store a quadratic polynomial of the lighting function at every point on the texture.

This results in some really cool lighting effects. One of the main flaws of game lighting models is that local details don’t interact at all, but with PTMs each pixel knows which directions light can come from (or at least an approximation of that). It gives you the effect of each of the little bumps on a surface self-shadowing the other bumps.

There are several issues of course. The first problem is authoring. PTMs seem to be mainly used for imaging ancient artifacts. It’s much easier to scan a PTM than to physically send a 1000 year old vase found in Africa to an expert in South America. But if you wanted to scan a game texture you need authoring tools (like tiling) which is a major undertaking.

Another problem is memory. Instead of a single diffuse map you need 6 diffuse maps to properly recreate the PTM. The lighting function is cheap once you read those maps but memory is always tight so 6 maps is a bit much.

I think the interesting variation is only storing baked occlusion. If you have a high-res z-brush sculpt then you could bake a PTM of the occlusion so that all of the little bumps and crevices would self-shadow. That would cost you six channels, which would fit into two BC7 maps. I believe Turtle (from Illuminate Labs, before they were purchased by Autodesk) still supports these and it is included in Maya 2014 and later.

5. New Research?
Finally, I’m sure there are new techniques out there to be discovered. I’ve already talked with several people doing interesting things that are improvements on these, but they aren’t released publicly so I can’t talk about them. The common thread is that there are a number of techniques for occlusion with the following properties:

  • Looks good.

  • Occludes based on light direction.
  • Has many parameters.
  • Is cheap enough to use in moderation, but too expensive to use everywhere.

For occlusion techniques fitting those four criteria you probably need forward shading, and I would expect to see more techniques like that in the future.