The Quest for Very Wide Outlines

An Exploration of GPU Silhouette Rendering

Ben Golus
29 min readJul 18, 2020

On almost every project I’ve worked on, at some point someone comes out with a bit of graphic design concept art like this:

Disclaimer: Not Real Concept Art

My usual response is to sigh, and start explaining why we can’t do outlines like that. Or at least start talking to them about the kind of deep modifications to the asset pipeline we would need to make this possible.

Of course my latest project is no different. But this time before I launched into my script on the differences between what Photoshop can do compared to the limitations of real time I had a thought.

GPUs are stupidly fast now. Maybe I should try it again.

This is a tale about trying to make very wide outlines for real time rendering. Attempting to over engineer and optimize the “bad” brute force way to see how fast I can get it. And then finally relenting and implementing the method I was willfully ignoring from the start.

Spoiler Alert: it’s the Jump Flood Algorithm.

Cracked Shell

One of the oldest and most used methods for adding outlines to any mesh is the inverted hull method. This has been used probably for as long as 3D rendering has been a thing. Just render the mesh twice, with the second version of it flipped inside out and slightly bigger. Jet Set Radio is an excellent early example of this technique. But it’s still used today to great effect, like any of the recent 3D fighting games from Arc System Works.

Jet Set Radio HD and Guilty Gear Xrd -SIGN- using inverted hull mesh outlines

The methods for achieving a slightly bigger mesh are varied, but the most common method is to move the vertices out by the vertex normal. For very old games this would have been done outside of the game and there would simply have been two meshes, but today it’s most commonly done with shaders.

The great thing with this style of mesh based outline is it’s very cheap, relatively versatile, and can do a very wide range of outline widths.

The problem with any mesh based outline system is that it requires some very intentional content setup. Without that you can easily get split edges, holes, and other artifacts. I’ve done this work before on several projects over the years. I knew I could fall back on it if we needed to, but I’ve never been completely pleased with the outcome, even with spending significant resources on the content pipeline and asset processing.

Holes and seams from hard mesh edges and inconsistent line widths

Even with the extra work it’s basically impossible to get consistent, pixel perfect outlines. It’s just not a solution that works on arbitrary meshes if you’re attempting to match the clean quality of a non-real time tool. Thus I started down this rabbit hole.

Eh Tu Brute?

What happens if I write a totally brute force outline shader today? How wide of an outline can I get before it becomes a noticeably impact on the frame rate for the ridiculously fast 2080 Super my computer is currently sporting?

So to find out, I wrote one. A really dumb straight forward one.

10 pixel radius Brute Force Outline ~2.7 ms @ 1080p

Looks pretty good! In fact it basically matches a Photoshop outline exactly. And it’s no where near as slow as I feared.

Basic Brute Force

Here’s a basic run down of what the code for this does. I render out a target mesh (or set of meshes) with a solid white shader to a full screen greyscale render texture. Then I render that back to the main target with a shader that naively samples all texels within some pixel range taking the max value.

Is this expensive? Absolutely! But outlines I would consider quite wide by usual brute force outline shader standards weren’t expensive enough to be a noticeable impact on frame rate. At least not on my RTX 2080 Super and a 1920x1080 target for a prototype that was already only taking less than 4ms to render everything else.

The graph is showing the millisecond cost of the whole effect, including rendering of the initial silhouette render texture, and the blit back onto the main frame buffer.

Going too wide gets expensive quickly though. This is because the shader is doing (2 * pixel radius + 1)² samples, and is doing this for every pixel on screen. The graph line is following pretty close to an typical O(n²) quadratic curve, which would make sense. At 20 pixels it’s nearly 10 ms, so I didn’t test going above that. At one point I accidentally set the radius to 80 and my GPU locked up.

Up to 5–6 pixel radius was almost reasonable for the use case of one or two unique outlines at a time. But that’s not really a very wide outline. Certainly no where near the width of the outline in the concept art.

Optimization Pass 1: Skip The Mesh

An obvious solution is to just not do it for the full screen. But how can you limit what pixels the effect is applied to? The seemingly obvious answer is “just limit it to some distance from the mesh”. But that’s less easy than it sounds. The outline shader is itself the kind of code that does that, and I’m trying to avoid doing that everywhere!

Luckily there’s at least an easy way to exclude the interior of the mesh. Render the original mesh with a stencil buffer only pass to mark the interior pixels. Then don’t render the outline shader there. This has the added benefit of making it composite accurately when MSAA is being used.

Yellow showing interior stencil mask

But that’s not always a large part of the screen unless the camera is zoomed really close. This meant when the object being outlined was small or off screen it was more expensive than when it was large on screen. I needed some way to limit it more.

Optimization Pass 2: Exterior Mask

CPU side mesh bounds are often significantly larger than the actual mesh for various reasons. So while I could easily calculate the rectangular screen coverage for a mesh renderer’s bounds, most of the time that would be almost the full screen anyway. Expanding the mesh itself has the problems I mentioned above, so there’s no guarantee any vertex expansion actually covers all of the area the brute force outline would.

I could calculate my own rectangular screen space bounds from the vertex positions. But I want this to work with skinned meshes and that complicates things significantly when using GPU skinning. The CPU doesn’t know where the vertices are, and there’s not a clean way to pass the vertex positions to a compute shader. I could also sample every texel in the silhouette render texture to find the min & max bounds. Both of those techniques aren’t super fast, and I’d still only have a rectangular area the outline would be limited to.

However I realized I could abuse mip mapping here to efficiently find the relative bounds of the outline, albeit in a much lower resolution form. So I generate the mip maps for the render texture, then sample the mip map level that represents the outline radius. That can be rendered to the stencil to create a mask of where you do want the outline rendered.

Green showing outer range stencil mask

I actually choose a mip level that is one less than the outline width, then do a variable width one texel outline in the mask shader to expand it. This helps confine the coverage a little more than full mip level steps. Especially since the expansion ends up exaggerated compared to actual outline due to the coarser resolution of the mip map.

We’re getting way faster now. This reduced the cost of the outline by a huge amount. This dropped the frame time costs by 50% on the over 3 pixel radius outlines to nearly 75% at 20 pixels. But there’s more we can do.

Optimization Pass 3: Line Interior Mask

While using the mip mapped render texture to find the approximate outline exterior, I accidentally messed up my math and ended up sampling the wrong mip level and realized it could also be used to calculate an approximate interior line area. This is a little more complicated than the exterior mask, since I need to limit it to mip map that only covers the inside of the outline circle. So I also use a shader that does some extra texture samples to do a 4 sample outline of the mip map level to fill in more of the interior. This is similar to the exterior mask’s 1 texel outline expansion, but it has to fit the inside of the radius rather than the exterior so it has to be much more conservative. I don’t do this pass for outlines under a certain width as it’s not really any cheaper than the shader it’s replacing at that point.

Red showing inner line fill stencil mask

This only really helps outlines above the 10 pixel radius. For the low single digit outlines it increased their cost by up to 50%, so I turn it off for anything less than that.

At this point I’ve done just about all I can do to cheaply limit the area the brute force outline shader needs to run. But really I felt like my attention needed to turn to the outline shader itself.

Optimization Pass 4: Skip Samples Outside Radius

In the original shader I’m sampling a square area of texels and then multipling the sample by the distance to the edge to get an anti-aliased circle. One of the first things I tried was to add a branch to entirely skip samples outside of the radius. Surprisingly this worked! It’s not a huge performance gain, maybe 10% faster, but it was something.

There were also a lot of failed attempts at optimization.

Failed Optimization Pass 1: Early Out

So I was already skipping any sample position inside the sample box but outside of the radius. So I took the extra step of adding yet another branch to exit if the max sample was already equal to 1. When running without any stencil mask, this would the overall effect little slower since any pixel not in range of the mesh is paying the cost of that extra branch for no benefit. This would remove any possible benefits gained by those pixels within range being possibly faster.

But with the stencil optimizations already reducing the area the shader was running to mostly areas close to the mesh, that issue goes away! So I added it back in and … it ended up being no faster. Whatever benefits there were to be had from skipping extra samples didn’t make up for the cost of the branch.

My guess as to why this didn’t help overall was not really the cost of the branch itself, though that was part of it, but rather the high divergence it caused. The short version is since every pixel’s shader invocation is going to be stopping at a different point, the GPU couldn’t make as efficient use of this optimization. The GPU runs multiple pixel shaders in batches, and all invocations in that batch cost as much as the most expensive pixel. So while some pixels may have been potentially much cheaper, they were still waiting for the more expensive pixels to finish. The result just happens to even out.

This was at least not multiple times slower…

Failed Optimization Pass 2: Sampling Mip Levels

Before I did the interior line masking I tried a different tactic for the line interiors. I attempted to adjust the mip level I sampled from in the shader to use the smallest mip map needed. I’ve come across countless articles on using mip levels with screen space ambient occlusion to optimize performance. The idea for SSAO is the further away from the center the sample is, the smaller the mip level you sample from. This reduces memory bandwidth requirements and works great with little quality loss!

For the outline it needed to be the opposite. Sample the top mip at the edge of the outline and drop the mip level the further from the edge it was.

This was much, much slower. Several orders of magnitude slower. Why? Because I was totally thrashing the texture cache. The brute force method meant most of the texture reads were reusing the data already in the cache. Switching the mip level between samples invalidated the cache. It works for SSAO because it’s using sparse samples and isn’t an exhaustive search like this outline shader needed to be.

Failed Optimization Pass 3: Skip Samples Inside Radius

Since I was now rendering the interiors of the lines using the mip map approximation pass, I added a very basic test in the shader to skip any samples that I assumed would be guaranteed to be covered by that pass. This technically was faster, with similar or better gains that skipping the samples outside the radius. But like the inner fill stencil it was only faster at the wider radii, mainly in the ranges that were too slow to be useful. Also since the inner fill stencil only ran above a 10 pixel radius it meant requiring two versions of the outline shader to switch between so it didn’t skip the interior samples, or making the existing branch more complex. The more complex branch made everything slower, loosing most of the benefits of using it at all. And using variants … well, it wasn’t a big enough gain to be worth the hassle, so I shelved it.

Failed Optimization Pass 4: Single Loop Instead of Nested

The original brute force shader is two loops and a branch to early out on samples outside of the circle. I decided to look into ways of using a single loop and calculating the UV offset from a single loop index. The math for this is pretty simple, and common for things like flip book shaders to convert a time into an atlas position. Surprisingly, like sampling different mip maps, this was also much slower. The minor extra ALU cost ends up being about 3 times more expensive than just keeping the extra loop.

I also tried using a sample pattern like what’s used for SSAO or bokeh depth of field effects where a Poisson disc or relaxed / spherized square is used. This too was slower. Probably due to a combination of texture cache thrashing, over sampling, as well as the extra ALU cost. It also came with artifacts when using a slightly lower sample count to avoid over sampling.

I considered things like using a Morton Z order, but at this point I kind of gave up. Everything I tried ended up being more expensive than just sampling the full square line by line and skipping samples early.

Done?

At this point I was pretty pleased with how well this worked. I could get outlines of around 20 pixels in radius for under 1.7 ms! Less time than the rest of the post processing we were already doing. And fast enough for a prototype where only one or two objects needed outlines at a time.

It’s still roughly an O(n²) quadratic curve, but significantly flattened compared to the original un-optimized version.

30 pixel Optimized Brute Force Outline ~2.5 ms @ 1080p

Unfortunately, at wider outlines the mip map based mask starts to show its weakness, resulting in a far less consistent curve.

The difference between a 32 vs 33 pixel outline showing significant increase in the mask area as the mip level increases

So around a 32 pixel radius was the limit for this approach. A slightly more complex mask at the wider widths would probably end up a win. But it was already a very complex bit of code with a lot of passes so I didn’t want to spend too much more time on it. Instead I decided to compare it against the other approach I’ve used in the past. Remember, I knew the brute force method wasn’t necessarily the best approach in terms of performance, just quality. This was all an experiment to see if it could be made fast enough to be used instead of this other, more common approach.

Blurring the Line

A well known way to do outlines is instead of doing an exhaustive search like the brute force approach do a separable Gaussian blur. The question is was my heavily complicated, er, optimized brute force faster?

So, I setup a simple test and…

Blur Based Outlines

Not really, no.

30 pixel Gaussian Blur Outline ~1.1 ms @ 1080p

At 10 pixels or below, the optimized brute force approach was faster, for most of the range. But not meaningfully so. Blurring is clearly faster for wider radii. However not as much as I’d feared for the sub 20 pixel radius ranges I was initially playing with. Really the blur approach only became really obviously better once you got into the over 15 pixel radius range. I’d expected an inflection point where the blur would end up faster, but I was expecting it in the small single digit radii, not over 10 pixels wide. Technically 1 pixel blur based outlines were faster, but the original brute force was faster than either at that size.

The main strength of the blur based approach is it’s O(n). The cost increases linearly with the radius!

And there’s still more optimization that could be had here. For a blur implementation there isn’t as easy a way to confine the area the blur ran on, so it was back to running full screen. That put it at a disadvantage to the optimized brute force. I tried some setups with attempting to reuse the existing stencil buffer from the main buffer, but Unity doesn’t make this easy so I never got that to work. I also tried creating the temporary render targets with a depth buffer to draw the stencil mask to that. The extra cost of doing that ended up completely removing any benefits it added and ended up making the blur even slower overall than the optimized brute force! Slower than the un-optimized brute force even for the first several pixel radii.

I could confine the area to the approximate rectangular screen bounds of the mesh. But an easier optimization is to down sample the image before blurring. This makes this technique much faster that either brute force past the first few pixel range. Depending on how much down sampling you wanted to do you can roughly halve or better the cost of the wider outlines. I’m also calculating the blur kernel in the shader, and I could probably optimize the blur some more by calculating that in c# before hand.

But I’m not going to do that. This was really meant as a test to see if I could get the a Brute Force approach faster than a Gaussian blur outline implementation than to see how well optimized I could get the later.

Because this technique also has a few problems that aren’t easily solvable.

30 pixel Optimized Brute Force vs Gaussian Blur Outline

Thin, sharp, or small features can disappear or fade, the outline isn’t as precise, and proper anti-aliasing is harder. The first two are because it’s a blur. The last problem is because you have to use a separable Gaussian blur which results in an non-linear falloff, and it’s a blur. Knowing how much to sharpen a variable width edge is difficult. Adding down sampling to improve performance makes all of the above problems worse. I knew about these before which is why I’d avoided it to begin with.

Highlighting artifacts when using a Gaussian Blur based outline

If you want a fuzzy glow, it’s great! If you want an outline that’s wide and are okay with it being rounded, it’s great! Like I mentioned earlier, I’ve used this before and it’s fast and effective for those use cases.

If you want something that competes with Photoshop’s outline quality and works well on thin edges, it’s just no where near as good as brute force.

At this point I was still happy enough with the optimized brute force approach that I figured that was it. Worse case I could swap to the Gaussian blur wider lines, and back to brute force the rest of the time.

I posted a bit about my journey on Twitter, thinking I’d done all I could.

And then someone said: “What about JFA?”

Missing the Jump

Frell.

I’d known about the jump flood algorithm for a while, and had some experience using the results of someone else’s implementation. I’d not been impressed with the quality and didn’t think it’d work for my use case. Mainly because I didn’t think it’d work well with my self imposed requirement of handling an anti-aliased starting buffer.

The reality is I just didn’t understand how it worked well enough and was scared of it. Luckily I decided to try to take the time to understand it. These links helped.

Jump Flood Algorithm

This is much faster for the wide outlines, and looks just as good as brute force.

30 pixel Jump Flood Algorithm Outline ~0.6 ms @ 1080p

Once I understood the technique better I understood that. But this is still felt like a fairly complicated technique with multiple expensive passes. Surely there must be some point where the optimized brute force and blur were faster, right?

Nope.

Ok, at 1 pixel the un-optimized brute force is faster than all of the options. But at every other radius the Jump Flood is straight up faster. A heavily hand optimized single pixel outline shader that could be nearly half as expensive as the dynamic brute force shader is. But a one pixel outline isn’t the goal here.

The real magic of JFA is how inexpensive it is at really ridiculously wide outlines.

100 pixel JFA Outline ~0.7 ms @ 1080p

Yes, that’s a chart that goes past a 2000 pixel radius. Full screen 1080p outlines for less than 1 ms. Even full screen outlines at an 8k resolution could be just slightly over 1 ms, ignoring the extra costs from memory bandwidth increases (hint: it’ll be a lot more than 1 ms). Unlike the brute force methods, which are O(n²), or the blur which is O(n), jump flood is O(⌈log₂ n⌉). That means the cost of the outline is constant for each power of 2 pixel radius. Like the Gaussian blur, I’m not doing any limiting, apart from how many jump flood passes I do, so this is always working full screen, and it’s still this fast.

Now these are properly wide outlines.

With These Three Easy Steps…

I mentioned before I didn’t think JFA would work well with my self imposed requirement of supporting anti-aliasing. And to some degree that’s still an issue, but I found a way to support anti-aliasing about as well as brute force. It’s actually better than brute force at handling thin or sharp edges, and certainly better than the Gaussian blur.

Before I go into that I want to talk a little about JFA in more detail, and the main stages of it for producing the outline. The links I posted above do a good job talking about the jump flood algorithm. If you’ve already looked at this, this might be retreading some info for you, but it’ll be important later.

The start for all of the outline techniques I tried is the same. I render out a greyscale mask of the meshes I want to have an outline. There are many possible ways to do this, but in my case I use a command buffer that renders each object with a solid white unlit material into a GraphicsFormat.R8_UNorm (aka RenderTextureFormat.R8) render texture. This is my starting mask I draw the outline from.

Mask Pass

Step 1: Collect Underpants

The first stage of JFA is the initialization shader pass. This takes the mask and outputs each valid texel’s screen space or UV position. For something like a black & white mask, any texels over some value threshold, like 0.5, output their position. Any values under the threshold output some other value that’s intended as a “no position” value, or some position well outside of the plausible range. In my case I output -1.0.

Jump Flood Init Pass

Step 2: ???

The second stage of JFA is the actual jump flooding pass. This shader pass samples 9 texels in a grid and gets the distance from the invoking pixel to the position stored in each of those 9 texels, then outputs the value of the closest one. This pass is run multiple times between two render textures with the spacing of the grid changing between each pass, depending on the pixel range you need. For example, if I wanted a 14 pixel outline, I’d need 4 passes with the first run using a 8 pixel spacing, the second using an 4 pixel, third using 2, and fourth using 1. The end result is a render texture where each texel holds the value of the closest position output by the original init shader pass within the range of the largest grid sample spacing. That starting wide spacing is where the “jump” of the jump flood gets its name.

This can be seen a little easier if the init pass only outputs the edge.

Init pass only outputting the mask edge
Jump Flood Passes for a 14 Pixel Radius
Final Jump Flood Pass Output for a 14 Pixel Radius, and 500 Pixel Radius

And yes, outputting only the edge in the init pass still produces correct results. And could be used to do both an interior, exterior, or centered line if so wanted. Some of the links I posted above show examples of just that. But I don’t need that. I only want an exterior line. Plus the outline version of the init shader is slower.

Step 3: Profit!

The third and final pass takes the output of the second stage and gets the distance from the current pixel to the stored closest position. From that I can just use the target outline width to figure out the color that pixel should be. I.E.: if it’s less than the outline width, output the outline color, otherwise output a transparent value.

10 pixel basic JFA outline

But just doing the basic jump flood outline ends up a bit jagged, with no anti-aliasing.

Anti-Aliasing

So how do I support anti-aliasing? Well, for one the third pass doesn’t use a hard on / off for the outline color output, but I let it use the distance fade the edge by 1 pixel. That helps, but it’s not the biggest problem. The init shader pass I mentioned above is effectively aliasing the output of the mask, so even if I render that render texture with anti-aliasing enabled, it’ll still end up having a jagged edge since that’s all the init pass is outputting.

Anti-aliasing from fading distance to aliased Threshold Jump Flood Init

In the above example you can see some minor anti-aliasing in the step corners. That’s the anti-aliased distance fade. But that doesn’t help the edges that are almost straight with the screen vertical or horizontal. There the stair stepping is still quite obvious.

I realized I could modify the init to make a guess at where the closest sub-pixel edge was based on the anti-aliased color. The basics of this is any mask texel that’s fully black I output the “no position” value. For any mask texel that’s white I output that pixel position. The values that aren’t fully black or white are the interesting ones. For that I check the texel values immediately around the current one and compare them to find the average direction.

For sub pixel position estimation, lets just think about this on one axis to start. One thing to be mindful of here is the position being output by a basic init pass is not actually of the geometry edge, but rather half a pixel inside the geometry edge.

This isn’t a problem though. It actually has some benefits we can come back to later. If I have an anti-aliased mask texel with a value of 0.5, it means the geometry was covering approximately half of that pixel. But from that single texel alone you don’t know which half. But we can make a good guess. By sampling the left and right texels we can estimate which direction the closest edge is in just by subtracting the right texel’s value from the left. If the right texel is 1 and left is 0, then I know the geometry is covering the right side. And we can adjust the half pixel inset position to be half a texel in that direction.

If the reverse is true then it’s covering the left side. If both the left and right values are the same, then the best guess we can make is that it’s centered on the pixel. I then do the same sampling above and below texels. This gives me a directional offset to approximate the subpixel position, which I then add to the current pixel position and output in the init pass.

Jump Flood Init approximating sub pixel edge position using left/right & above/below texel samples

This looks quite good compared to the original brute force version. However just doing the horizontal and vertical axis meant some obvious artifacts in sharp corners where a single pixel could show up. Below you can see the corner has a “bubbled” look to it compared to the brute force approach. Though that too isn’t quite right and is a little too soft and rounded.

“bubble” artifact on sharp anti-aliased corners

This is because on single pixel corners the estimation basically thinks this is a pixel floating by itself and doesn’t do any (or enough) offsetting. So I sampled diagonals as well and add those with a slightly reduced weight. Then I normalize the resulting direction vector. If some of you are reading that and thinking “that sounds like a Sobel operator”. Yep.

correct outline on sharp anti-aliased corners

I had tried a couple of different weights on the diagonal and ended up recreating Sobel on accident.

This ends up being very close to the original brute force approach. And because of the sub pixel estimation and the fact it doesn’t fade out on anti-aliased corners, may actually be closer to the ground truth.

Before I mentioned that having the init pass output position being half a pixel inset from the real edge, and that this was actually an advantage. The reason for that comment is imaging a line that’s only a single pixel wide, or narrower.

In the above example, the estimated edge doesn’t line up with the real geometry. But we can only store a single position per texel, and we can only see that single texel line. If we were attempting to store the actual edge position, in the above case there are two edges and we’d have to pick one. The result would be the nearest distance wouldn’t be centered and the outline would be wider on one side than the other. For wide edges, that probably wouldn’t be obvious. But a 1 pixel outline? That would be obvious as only one side of the line would get an outline. This is why storing a half-pixel inset position is beneficial. We can still draw a correct 1 pixel outline on a 1 pixel wide shape by using a 1.5 pixel distance. This is something even the brute force method has a hard time with.

1 pixel Jump Flood Outline on geometry less than a pixel wide (500% Zoom)

It’s not perfect. In the above image if you look closely you can see the outline fades out a little where it shouldn’t a little down from the top where it transitions from 25% coverage to 50% coverage. Doing a little more work in the init could probably fix this, but this is a rare enough scenario that I’m happy leaving it as is. I mean, it’s pretty darn good as is, and is only obvious when this zoomed in.

In comparison the brute force method just fades out entirely.

1 pixel Brute Force Outline on geometry less than a pixel wide (500% Zoom)

That’s clearly wrong, even when not zoomed in.

Optimizations

This is already incredibly fast, but it didn’t stop me from at least attempting some further optimizations.

“Optimizations?”

I found only doing the diagonals and not doing a full Sobel gave very nearly identical results most situations. And it was faster! This produced a compiled shader that had roughly half as many instructions! Nice optimization right?

Well, not really. It was technically faster, but trivially so. It was faster by about a single microsecond. That’s one millionth of a second faster, less than the margin of error in profiling. Less than half a percent of the effect’s cost for a single pixel radius outline. It also caused the lines to get slightly too wide on diagonal edges. So while it was faster, I still do the whole Sobel operation as it doesn’t meaningfully impact the performance. The init shader pass was also already the least expensive of the three, so this was more an attempt to squeeze water from a stone.

The jump flood passes take up most of the time, so that’d be a better place to look to optimize. And I did think of a pretty simple one here. The basic idea is you’re looking for the closest of 9 the samples by comparing the distance from the current pixel. Basically all examples you’ll find, including those I link above, use a length() call and compare the results. But you don’t need the linear distance. You can get away with the square distance, which cheaper to compute with a dot product. That removes 9 square roots from the shader, so that should be a good savings.

Again, nope. I could not measure a difference here at all, not even a single microsecond. More or less any kind of clever shader change I attempted here resulted in no change, or made things slower. Both of these shader passes are completely hardware texture unit limited. I still use this optimization though, because it does no harm.

Render Texture Format

The biggest savings I got was to use a GraphicsFormat.R16G16_SFloat (aka RenderTextureFormat.RGHalf) render texture instead of a GraphicsFormat.R32G32_Sfloat (aka RenderTextureFormat.RGFloat). That reduced the memory requirements in half, but still more than enough precision. That alone dropped the cost of the init pass from 47 μs to 28 μs, and the jump flood pass from 75 μs to 67 μs. Not a lot, but for a 50 pixel radius outline the entire effect dropped from 630 μs to 565 μs. That’s at a least measurable improvement of around 65 μs, even if they’re both effectively 0.6 ms. However R16G16_SFloat had some weird precision issues, which caused the outline to be slightly offset in the subpixel position, even when not using the subpixel estimation. So I swapped to R16G16_SNorm instead, which removed the issues while still retaining the same memory footprint. This is ever so slightly slower than the R16G16_SFloat format. R16G16_UNorm also works, and is technically twice the precision for the 0.0 to 1.0 range I was using, but it requires a small amount of extra math to scale the encoded range that added about 8 μs to the same 50 pixel radius test. And you could use the R16G16_SNorm with a similar bit of extra math to get the same precision at the same relative cost increase. For 1080p, I didn’t think it was necessary.

Separable Axis JFA

The next optimization came in the form of an idea proposed by Alan Wolfe (aka demofox), who’s article I linked above.

This splits up each of the flood pass into two separate passes, each only doing one axis at a time. This nearly doubles the total number of passes required for the effect, but reduces the number of texture samples by a third. Amazingly it is faster! By about 35 μs total, bringing that 565 μs down to 530 μs. Honestly I have no idea if it’ll be faster on all GPUs though.

Downsample

There is one more “easy” optimization. Like the Gaussian blur approach this technique lends itself well to relatively easy downsampling. For very wide outlines or higher resolutions it wouldn’t be too difficult to render the starting mask at a lower resolution, or downsample it for better quality, and run the JFA at that lower resolution. Especially with the added sub pixel estimation in the initialize pass. I did actually try this and it can really significantly improve performance as you would expect.

But also like the Gaussian blur it means some of the details get smoothed out. There’s also a gotcha with the final outline pass. You can’t just use bilinear sampling on the output from the jump flood passes as it’s interpolating an offset position. This produces very weird artifacts in any interior corners. So instead you need to add one more pass to decode the distance field to a texture for the final outline pass to sample from. And even then you may want to add some kind of higher quality bicubic or catmull-rom filtering to hide the linear artifacts of bilinear filtering. I don’t implement this. Might be a good option for supporting very high resolutions more easily.

Compute

A compute shader would likely be faster at this than the render texture approach I’m currently using. But that’s a task for another day.

Final Words

So, obviously I failed at making a brute force outline that could compete with a more efficient approach. But I hope you enjoyed reading about my trip as much as I did. And now we have a much more efficient, and more adaptable outline technique. Mainly because a the jump flood based approach can do so much more. For example you could use a gradient instead of a simple edge. Even an animated gradient if you really want to be fancy.

You could also do interesting things like render out a depth texture and composite the outline into the scene with the depth of the closest edge! But I’ll leave you to play with that.

Bye!

Additional Thoughts

After posting this I got a couple of questions and comments on the article I want to address.

Geometry Shaders

I’ve had a couple of people ask about or even show their geometry shader options. I’m not a fan of these for various reasons. There are several very convincing academic articles on using geometry shaders to do very high quality outlines, but they all require adjacency data. That’s literally having each triangle know the vertices of its adjacent triangles. No real time engine I’m aware of actually supports this. And even if it did it still requires the mesh to have no seams. The easier and still effective method is to generate outline geometry on every single triangle of the mesh, but this requires either some amount of geometry processing similar to the vertex push method, and/or an extreme amount overdraw.

In the end they end up being way worse for performance than something like the vertex push method, with only minor improvements. It’d be interesting to compare them at some point as it is possible to generate accurate distance fields on arbitrary geometry with a version of this technique.

An Outline of an Outline

This is a great suggestion. This would work too, and would allow the brute force method to effectively match the Gaussian blur approach in terms of having a linear outline width to cost. Maybe even beat it! It would be a fun thing to try. You probably don’t need to re-create the mip chain every time, but you would have to clear and redo the stencil for every pass. I’d also be concerned any minor errors would be compounded.

Still won’t be anywhere close to JFA though.

Outlined Sprites

One nice advantage all of the image based outline techniques have over geometry based approaches is they can work on objects that have their edge set by a texture rather than geometry. Like a sprite or any transparent material. It’s as “simple” as rendering to the silhouette texture with alpha instead of as solid geometry. The big caveat is only the brute force methods will work properly with any kind of soft alpha. And even that will work better if you hard edge. Indeed the JFA can only support hard edges. So you’d need to render to the silhouette texture using alpha test, or better yet a sharpened alpha blend or alpha to coverage, like I detailed in my alpha to coverage article.

The Sobel based JFA initialization in the implementation example expects any kind of anti-aliasing to be no more than a single pixel edge, otherwise the estimation will fail.

Real Numbers

Here’s the raw table of numbers I used to generate the graphs above. I don’t remember if the numbers for the straight brute force are using a stencil for the interior mask or resampling the original silhouette. Optimized brute force and Gaussian blur are as described above. Jump flood numbers are all the same for each “jump” in radius, not because that’s how they actually benchmarked, but because I had rerun this several times as I was tweaking it and started getting lazy. Plus I knew from previous runs the number didn’t change meaningfully within each radius jump, and because the fractional values weren’t visible in the graph.

Left column is the pixel radius. All numbers are in microseconds, not milliseconds.

Code Samples

Here is example code for the final anti-aliased jump flood outline effect.
(direct link to gist.)

--

--