Distinctive Derivative Differences

Pesky Problems with Procedural UVs

In my article on Sphere Impostors I talked a bit about two different issues that come up when using procedural UVs calculated in the fragment shader. And showed solutions for both. Unfortunately it seems one of those solutions has an Achilles’ heel.

This article is about some alternative solutions I knew of, and one I didn’t until I was writing this article. And how the one I didn’t know taught me about stuff I didn’t know I didn’t know.

Note: This article is written with Unity in mind and the presented example code is written in HLSL. But the techniques discussed will work for any shader language, and indeed most of this article is assuming the HLSL is being transpiled for use with graphics APIs other than Direct3D.

The “Solution”

So what solution am I talking about? This one: “Seamless cylindrical and toroidal UV-mapping”, by Marco Tarini.

The basic idea of the paper was to use two separate UV sets to avoid needing to add a vertex seam on meshes. Honestly, for the specific use case it was proposed for, it wasn’t that useful. Modern GPUs are not especially worried about minor increases in vertex counts. And increasing the cost of all vertices to avoid a modest increase in vertex count was of questionable usefulness even when the paper was released. However it’s proven invaluable as a technique for fixing seams on purely procedural UVs. At least it has been invaluable for me for roughly the last decade.

As a recap, the problem case I use this technique to fix is when using atan2() to generate either spherical or angular UVs inside of a shader.

Something like this:

float3 normal = normalize(i.normal);float2 uv = float2(
// atan returns a value between -pi and pi
// so we divide by pi * 2 to get -0.5 to 0.5
atan2(normal.z, normal.x) / (UNITY_PI * 2.0),
// acos returns 0.0 at the top, pi at the bottom
// so we flip the y to align with Unity's OpenGL style
// texture UVs so 0.0 is at the bottom
acos(-normal.y) / UNITY_PI
);
fixed4 col = tex2D(_MainTex, uv);

These produce the UVs for sampling an equirectangular texture. But using UVs calculated like this has a problem. At the point where the UVs wrap, there’s a visible seam.

I’m going to be lazy and quote myself to explain why:

GPUs calculate the mip level by what are known as screen space partial derivatives. Roughly speaking, this is the amount a value changes from one pixel to the one next to it, either above or below. GPUs can calculate this value for each set of 2x2 pixels, so the mip level is determine by how much the UVs change with each of these 2x2 “pixel quads”. And when we’re calculating the UVs here, the atan2() suddenly jumps from roughly 0.5 to roughly -0.5 between two pixels. That makes the GPU think the entire texture is being displayed between those two pixel. And thus it uses the absolutely smallest mip map it has in response.

And the proposed solution:

The idea is to use two UV sets with the seams in different places. And for our specific case, the longitudinal UVs calculated by the atan2() are already a -0.5 to 0.5 range, so all we need is a frac() to get them into a 0.0 to 1.0 range. Then use those same partial derivatives to pick the UV set with the least change. The magical function fwidth() gives us how much the value is changing in any screen space direction.

// atan returns a value between -pi and pi
// so we divide by pi * 2 to get -0.5 to 0.5
float phi = atan2(normal.z, normal.x) / (UNITY_PI * 2.0);
// 0.0 to 1.0 range
float phi_frac = frac(phi);
float2 uv = float2(
// uses a small bias to prefer the first 'UV set'
fwidth(phi) - 0.0001 < fwidth(phi_frac) ? phi : phi_frac,
// acos returns 0.0 at the top, pi at the bottom
// so we flip the y to align with Unity's OpenGL style
// texture UVs so 0.0 is at the bottom
acos(-normal.y) / UNITY_PI
);
fixed4 col = tex2D(_MainTex, uv);

And now we have no more seam!

Pretty great, right? So why am I writing a whole new article on it?

Because it doesn’t work when using OpenGL, Vulkan, or Metal.
At least not with Unity (yet…).

Screen Space Partial Differences

Lets talk a little bit more about the specifics of screen space partial derivatives. Above I talked about pixel quads, the 2x2 groups of pixels GPUs use when rendering. It’s important to understand it’s a grid of quads across the entire screen. Each quad knows nothing about the quad next to it. When a GPU is running the fragment shader for one on screen pixel it is always running the fragment shader on all four pixels of the quad in parallel. This is true even if a triangle is only visible in one of those pixels! Because the GPU is always running all four pixels of the quad in parallel, it can look at the current value from all four to calculate the difference.

This all exists explicitly to support mip mapping. If you know how much the UVs have changed in the pixel quad, and you know the resolution of the texture being sampled (which the GPU hardware texture unit does), then you can calculate the appropriate mip level to display. Using a bit of math like this:

// OpenGL's reference mip map level calculation converted to HLSL
// texture_coord = uv * texture resolution
float CalcMipLevel(float2 texture_coord)
{
float2 dx = ddx(texture_coord);
float2 dy = ddy(texture_coord);
// get the max squared magnitude of the change along x and y
float delta_max_sqr = max(dot(dx, dx), dot(dy, dy));
// equivalent to log2(sqrt(delta_max_sqr))
return max(0.0, 0.5 * log2(delta_max_sqr));
}

It’s also important to know that the partial derivatives are calculated as the same value for all pixels within the quad. It’s always the difference of the value between the “first” pixel in the pixel quad and the pixel to the right or the pixel below. Some of you may have noticed this means the bottom right pixel’s value is ignored. We’ll get to that in a bit.

The derivatives being constant for the entire pixel quad means the calculated mip level also is. This is an important optimization for the GPU, as it guarantees the entire pixel quad can reuse the same mip map texture data fetched from main memory.

Originally the partial derivatives were only being used for texture mip mapping. The derivatives for mip mapping was calculated by the hardware texture units from the UVs passed to them by the tex2D() function calls within the pixel quad, and still is. But eventually it was realized that partial derivatives were useful for other things too, so the ddx() and ddy() functions were added to shaders. Like those used for mip mapping, they returned values that are constant for the entire pixel quad. By extension this meant the value returned by fwidth(), a function that adds the absolutes of both ddx() and ddy() together, is constant for the entire pixel quad.

And that gets us to where we were before. Because the derivatives in the shader and the derivatives used for mip mapping match, we can usefwidth() to find out which UV set changes the least in the pixel quad and use that one for the texture sample to avoid the seam artifact.

A Coarse Look at a Fine Problem

At some point people writing shaders and building GPUs realized it would be really handy to have more accurate partial derivatives. Having a constant value for the entire pixel quad wasn’t always the best looking option, and the GPU has the data to do it more accurately, at least within a single quad. So about 11 years ago with the release of DirectX 11, they added ddx_fine() and ddy_fine() to HLSL. Unlike the ddx() and ddy() of old, this would give you the partial derivative of the two horizontal or vertical pixels in whatever row or column it was called in. This meant the values wouldn’t be constant for the entire pixel quad anymore. They also added ddx_coarse() and ddy_coarse() which used the older behavior of calculating derivates, and the original derivative functions are aliases for the coarse versions.

Note, texture mip mapping also remained using coarse derivatives. Remember, it’s still a useful optimization to have the same values for the entire quad, and the visual difference is negligible to non-existent for traditional UVs, so it wasn’t worth changing.

But here’s where the problem comes in.

DirectX’s spec requires that the base ddx() and ddy() functions default to coarse behavior.

But OpenGL’s spec leaves the choice up to the device.

And it seems a lot of GPUs started to default the equivalent GLSL derivative functions to use the fine behavior. Worse, GLSL didn’t get separate coarse and fine derivative functions until nearly 6 years after Direct3D with OpenGL 4.5. This was around the time everyone had their eyes on Mantle, and shortly after Vulkan. Most people had stopped really looking at OpenGL for future facing stuff. This was compounded by MacOS, the one major platform that still used desktop OpenGL (sorry Linux), being artificially stuck on with OpenGL 4.2. So they never got support for those explicit accuracy derivative functions.

This means lots of GPUs had the capability of doing either fine or coarse derivatives. And really most of the time the fine accuracy derivatives are much better to use than the coarse ones. So many of those GPUs have chosen to to default to fine derivatives in OpenGL. Certainly most modern desktop GPUs seem to.

And again, the texture mip mapping is still using coarse derivatives.

This means the derivatives the shader has access to do not match those used for mip mapping. And there are cases when those in shader fine derivatives won’t show a UV discontinuity that the texture mip mapping coarse derivatives do because they are not constant for the entire pixel quad. Thus Tarini’s original method no longer works on some platforms and the seam reappears.

Newer APIs & Unity

So, the obvious solution would be, lets use the newer APIs. If OpenGL 4.5 supports explicit accuracy derivative functions, why don’t we at least use those for Linux. And Vulkan and Metal must at least have them too, right?

Sort of.

The problem is Unity doesn’t actually seem to be capable of producing GLSL 4.5 code, it’s capped out at version 4.3, so any unique GLSL features added in 4.4 or 4.5 can only be used if you hand write the GLSL code for them. This means if you add ddx_coarse() or ddx_fine() to an HLSL shader, Unity converts them both to the base ddx() equivalent (dFdx()) in the GLSL. And I don’t expect that to change since there’s not a lot of reason to use OpenGL anymore over Vulkan or Metal on the platforms that traditionally used it and support newer versions of OpenGL. OpenGLES doesn’t even have explicity accuracy functions at all. There are ways to request a certain level of accuracy for derivatives that both OpenGL and OpenGLES specs include, but the spec also implies actually doing anything with that is entirely optional too.

What about Vulkan? Like OpenGL, the behavior of the base ddx() equivalent is up to the device, and like OpenGL it seems many have chosen to default to fine accuracy derivatives. Luckily Vulkan does have explicit accuracy derivative functions. So this should be a none issue, right? Unfortunately Unity converts both ddx_coarse() and ddx_fine() to the base ddx() equivalent for Vulkan too. Hopefully they’ll fix this one, but for now there’s no way to force coarse derivatives when using Vulkan.

Lastly there’s Metal. Metal does not have coarse derivatives at all. Well, the texture unit still does coarse derivatives for mip mapping*. But the shader language does not have coarse derivative functions. In the spec the base derivative functions are described as “high precision” which would suggest they’re always using the equivalent of the fine accuracy derivatives in the other APIs. And indeed they are.

* We’ll swing back to that asterisks later.

Everything Is Fine

The good news is there are solutions that work universally and don’t depend on the shader derivative functions matching the hardware texture unit derivatives. The below options all work regardless of if the default derivative functions use coarse or fine accuracy.

Explicit LOD

Tarini’s original method made use of tex2D(). That sampling function relies on derivates that are calculated by the hardware texture unit to calculate the mip level. What if we calculate the mip level manually in the shader instead? We can do that! I even posted the reference mip level function above! And there’s a texture sampler function that lets us pass an explicit mip level, or “LOD” to use instead of relying on the hardware derivatives, tex2Dlod().

// explicit LOD example// atan returns a value between -pi and pi
// so we divide by pi * 2 to get -0.5 to 0.5
float phi = atan2(normal.z, normal.x) / (UNITY_PI * 2.0);
// 0.0 to 1.0 range
float phi_frac = frac(phi);
// acos returns 0.0 at the top, pi at the bottom
// so we flip the y to align with Unity's OpenGL style
// texture UVs so 0.0 is at the bottom
float theta = acos(-normal.y) / UNITY_PI;
// construct the primary uv
float2 uvA = float2(phi, theta);
// construct the secondary uv using phi_frac
float2 uvB = float2(phi_frac, theta);
// get the min mip level of either uv sets
// _TextureName_TexelSize.zw is the texture resolution
float mipLevel = min(
CalcMipLevel(uvA * _MainTex_TexelSize.zw),
CalcMipLevel(uvB * _MainTex_TexelSize.zw)
);
// sample texture with explicit mip level
// the z component is 0.0 because it does nothing
fixed4 col = tex2Dlod(_MainTex, float4(uvA, 0.0, mipLevel));

Because we’re calculating the mip level in the shader, it doesn’t matter what accuracy the default derivative functions use since it no longer needs to match the hardware texture unit’s derivatives. So this will work regardless of if we’re getting fine or coarse derivatives.

However there are a few disadvantages to be aware of. One is this is a more expensive shader. Not really significantly so, but it is a few more instructions than the original. And tex2Dlod() can sometimes be slower than tex2D(). It should also be noted that the reference mip level calculation does not actually match any modern GPU’s hardware mip level calculation. Every GPU out there makes some kind of modification or optimization to that reference. Like using lower precision math, or using an average of the derivatives, etc. How exactly they calculate it generally isn’t public. And it also probably doesn’t matter. It’s just something useful to know about and totally won’t become something important later on in this article.

Calculating the mip level manually also means any mip bias that might be set on the texture’s sampler state will be ignored. This is more common that most people might realize, as engines that rely on temporal anti-aliasing will usually set a global mip bias to keep textures from getting too blurry.

For me the bigger concern is anisotropic filtering is disabled. This is because anisotropic filtering relies on having derivatives to calculate more than just a single mip level. It also calculates the anisotropic direction and magnitude. By providing an explicit mip level, the GPU is being told to assume there are no valid derivatives. The lack of anisotropic filtering is especially notable for this specific case of equirectangular UVs as at the poles the texture is pinched to a point, potentially causing yet another mip mapping artifact.

On the other hand, this will be faster than Tarini’s method, if you’re using a texture with anisotropic filtering enabled. But that’s cheating for this use case. It also means it’s faster than doing no seam fixup at that point. However if you have anisotropic filtering disabled, this will be ever so slightly slower than Tarini’s method. But we’re talking a less than 1% difference in my tests.

Explicit Gradients

Luckily for as long as fine accuracy derivatives have existed, so has the ability to provide explicit derivatives. For HLSL this is in the form of tex2Dgrad(). So named because gradient is another term for derivative. So, like the explicit mip level option, we calculate our own derivatives and tell the GPU to use those.

// explicit gradients example// atan returns a value between -pi and pi
// so we divide by pi * 2 to get -0.5 to 0.5
float phi = atan2(normal.z, normal.x) / (UNITY_PI * 2.0);
// 0.0 to 1.0 range
float phi_frac = frac(phi);
// acos returns 0.0 at the top, pi at the bottom
// so we flip the y to align with Unity's OpenGL style
// texture UVs so 0.0 is at the bottom
float theta = acos(-normal.y) / UNITY_PI;
// construct uv without doing anything special
float2 uv = float2(phi, theta);
// get derivatives for phi and phi_frac
float phi_dx = ddx(phi);
float phi_dy = ddy(phi);
float phi_frac_dx = ddx(phi_frac);
float phi_frac_dy = ddy(phi_frac);
// select the smallest absolute derivatives between phi and phi_frac
float2 dx = float2(
abs(phi_dx) - 0.0001 < abs(phi_frac_dx) ? phi_dx : phi_frac_dx,
ddx(theta)
);
float2 dy = float2(
abs(phi_dy) - 0.0001 < abs(phi_frac_dy) ? phi_dy : phi_frac_dy,
ddy(theta)
);
// sample the texture using our own derivatives
fixed4 col = tex2Dgrad(_MainTex, uv, dx, dy);

Again, this will work regardless of if we’re getting fine or coarse derivatives. Technically we can get slightly more accurate anisotropic filtering if using fine derivatives. But that difference is so minimal that you’d be hard pressed to tell them apart. And even if you could, you’d have a hard time pointing to which is the more accurate one. Like I said before, there’s a reason why GPUs still universally use coarse derivatives for the mip level calculation. It will also work with any mip bias that might be set on the texture sampler state.

This also looks like it should be slower than Tarini’s original method, and it is, but maybe not for the reason you may expect. In terms of raw instruction count this is identical. However tex2Dgrad() is itself is usually slower than tex2D() or even tex2Dlod(). Why? Because it’s sending 3 times as much information out of the shader to the hardware texture unit. And because on some GPUs using tex2Dgrad() causes it to make assumptions about whether or not it can reuse the cached mip level across the entire quad. On mobile this can be a very noticeable performance hit. On desktop it certainly can be measurable, but likely not enough to be impactful. With reasonably sized textures, the performance difference is less than 2%. On my 2080 Super with very large uncompressed textures this rises to an over 40% increase in render time for this specific draw call. AMD GPUs on the other hand the performance difference decreases as the texture gets larger, and may even be slightly faster that Tarini’s original method in the large uncompressed texture case. But we’re talking tiny fractions of a millisecond differences here either way.

Coarse Emulation

Honestly, this option is just kind of silly. But I’m added it in here for the fun of it. We can abuse fine derivatives to pass values between the pixels in the quad, which means we can reproduce coarse derivatives in the shader if we only have access to fine derivatives. This is more formally called “in quad communication.” The idea is to use the ddx_fine() and ddy_fine() to get the values vertically & horizontally from the current pixel, then call ddy_fine() and ddx_fine() again on the derivatives to get the value horizontally and vertically from those. In effect you can get the values for all 4 pixels in the quad and use that to reconstruct the coarse derivatives. As a nice bonus, this works when using coarse derivatives too since the alternate derivatives you get aren’t any different.

// coarse emulation// atan returns a value between -pi and pi
// so we divide by pi * 2 to get -0.5 to 0.5
float phi = atan2(normal.z, normal.x) / (UNITY_PI * 2.0);
// 0.0 to 1.0 range
float phi_frac = frac(phi);
// acos returns 0.0 at the top, pi at the bottom
// so we flip the y to align with Unity's OpenGL style
// texture UVs so 0.0 is at the bottom
float theta = acos(-normal.y) / UNITY_PI;
// get derivatives for phi and phi_frac
float phi_dx = ddx(phi);
float phi_dy = ddy(phi);
float phi_frac_dx = ddx(phi_frac);
float phi_frac_dy = ddy(phi_frac);
// get position within quad
int2 pixel_quad_pos = uint2(vpos) % 2;
// get direction within quad
float2 pixel_quad_dir = float2(pixel_quad_pos) * 2.0 - 1.0;
// get derivatives the "other" pixel column / row in the quad
float phi_dxy = ddx(phi - phi_dy * pixel_quad_dir.y);
float phi_dyx = ddy(phi - phi_dx * pixel_quad_dir.x);
float phi_frac_dxy = ddx(phi_frac - phi_frac_dy * pixel_quad_dir.y);
float phi_frac_dyx = ddy(phi_frac - phi_frac_dx * pixel_quad_dir.x);
// check which column / row in the quad this is and use alternate
// derivatives if it's not the column / row coarse would use
if (pixel_quad_pos.x == 1)
{
phi_dy = phi_dyx;
phi_frac_dy = phi_frac_dyx;
}
if (pixel_quad_pos.y == 1)
{
phi_dx = phi_dxy;
phi_frac_dx = phi_frac_dxy;
}
// fwidth equivalents using the "coarse" derivatives
float phi_fw = abs(phi_dx) + abs(phi_dy);
float phi_frac_fw = abs(phi_frac_dx) + abs(phi_frac_dy);
// construct uvs like Tarini's method
float2 uv = float2(
// uses a small bias to prefer the first 'UV set'
phi_fw - 0.0001 < phi_frac_fw ? phi : phi_frac,
theta);
fixed4 col = tex2D(_MainTex, uv);

This is a lot more code. And a lot more instructions. So of course it’d make sense that this is the slowest option…

Except it’s not!

This is very nearly as fast as Tarini’s original! At least it is on a Nvidia Turing GPU. If I artificially reduce the texture size to something tiny, it’s somewhere between the tex2Dlod() and tex2Dgrad() methods in pure shader execution cost because it is a more complex shader. But on larger textures it’s a huge win over tex2Dgrad(). Without anisotropic filtering enabled on larger textures it even matches the tex2Dlod() method.

I honestly don’t know entirely how to explain why either. In previous tests of using in quad communication for other things I didn’t find it was any faster or visually better than alternatives. Usually slower, or visually worse, and sometimes both. So I was very surprised when I tried it here and not only did it end up working at all, it very nearly matched the original in performance, worked regardless of the derivative accuracy, while also matching quality of the visually best options. I literally wrote the line about it being the slowest before I actually tested it because I was absolutely sure it was going to be just a curiosity and not actually a viable option.

As I only have access to Nvidia GPUs at the moment, I asked for assistance on Twitter. Matt Pettineo, Aras Pranckevičius, and Nick Konstantoglou kindly provided me with some timings from an AMD Radeon RX 460, an Apple M1, and an AMD Radeon RX 6800XT. Unlike the Turing GPU, it seems for GCN 4th Gen GPUs in quad communication ends up being ever so slightly slower than tex2Dgrad() rather than faster. Both are about a 10% increase from the original Tarini method. But again, these are fractions of a millisecond differences. The RX 6800XT is similar with the in quad communication technique being the slowest. More interestingly tex2Dgrad() can sometimes beat Tarini’s method on the AMD GPUs.

But there is still an issue. It doesn’t work on some mobile GPUs.

Least Worst Quad Derivatives

Remember how I said the derivatives the shader uses need to match the ones used to calculate the mip level? What if I told you not all GPUs used coarse derivatives for mip level calculation? The shader I presented above makes the assumption that all GPUs use coarse derivatives for calculating the mip level.

But that is in fact not true. This is an assumption I had made up until just a few days ago when Aras mentioned in an offhand comment that the above in quad communication shader still showed some artifacts on his Apple M1 based Mac. I assumed I’d just goofed up the math somewhere, but out of caution I slapped together an ugly little ShaderToy test.

For my Nvidia GPU it ends up looking something like this.

Nvidia 2080 Super

Remember how I said with coarse derivatives the UVs from one pixel in the quad is ignored? This shader is offsetting one pixel’s UVs within the quad to try to expose which pixel is being ignored. Which pixel matches up with which quadrant of the window it’s in, with the center ellipses using an unmodified UV to show what no modifications to the UV would look like.

If you’re thinking, “Hey wait, it should be the bottom right pixel that’s ignored! That’s showing the top right!” Good memory! ShaderToy uses WebGL, and for the GL family of APIs the vertical is flipped from all other APIs. Don’t worry about it too much for now, just mentally flip that image upside down if you’re thinking in Direct3D. That’s also what Unity does when you’re using any API other than OpenGL, but lets not go into that right now.

I assumed that when I loaded that shader up on my iPhone and iPad I’d see a different quadrant showing the noise. But no, that’s not what I saw. I saw this.

iPhone SE 2020

None of the quadrants showed the noise. That meant Apple’s GPUs don’t use coarse derivatives for mip level calculation. They appear to use the largest fine derivatives of the entire quad on each axis.

I posted this on twitter and Firepal noticed that their phone’s Mali GPU also did not show the noise in any of the quadrants. Curiously it also didn’t show a solid grey, but what looked like some mip just above the smallest mip level. My best guess was it’s doing some kind of average of the derivatives of the entire quad on each axis. Pete Harris confirmed that guess was at least on the right track.

So that’s at least two GPUs that don’t use the coarse derivatives for calculating the mip level. And there may be others. They do all still use a constant mip level for the entire quad though, at least as far as I can tell.

I also lied a little bit when I said the derivatives in the shader need to match those used to calculate the mip level. Really we just need to make sure our guess as to which of the two UV sets is better matches in both. Tarini’s original assumption that eventually failed was that both the derivatives in the shader and those used for mip level calculation were the same coarse derivatives. Specifically only the 2 derivatives from the same 3 pixels were being used in both. Fine derivatives broke this assumption by potentially using a different set derivatives per pixel within the quad than the mip level calculation. And Apple and Mali broke this assumption by using the 4 derivatives from all 4 pixels with in the quad for mip level calculations.

Luckily with in quad communication we are already calculating the 4 derivatives. Previously we were using those to emulate coarse derivatives, but would could instead compare them against each other to find the worst derivatives for the entire pixel quad. And this will work for this case regardless of how the GPU actually calculates the mip level. That means we can write one shader to rule them all!

// least worst quad derivatives// atan returns a value between -pi and pi
// so we divide by pi * 2 to get -0.5 to 0.5
float phi = atan2(normal.z, normal.x) / (UNITY_PI * 2.0);
// 0.0 to 1.0 range
float phi_frac = frac(phi);
// acos returns 0.0 at the top, pi at the bottom
// so we flip the y to align with Unity's OpenGL style
// texture UVs so 0.0 is at the bottom
float theta = acos(-normal.y) / UNITY_PI;
// get derivatives for phi and phi_frac
float phi_dx = ddx(phi);
float phi_dy = ddy(phi);
float phi_frac_dx = ddx(phi_frac);
float phi_frac_dy = ddy(phi_frac);
// get position within quad
int2 pixel_quad_pos = uint2(vpos) % 2;
// get direction within quad
float2 pixel_quad_dir = float2(pixel_quad_pos) * 2.0 - 1.0;
// get derivatives the "other" pixel column / row in the quad
float phi_dxy = ddx(phi - phi_dy * pixel_quad_dir.y);
float phi_dyx = ddy(phi - phi_dx * pixel_quad_dir.x);
float phi_frac_dxy = ddx(phi_frac - phi_frac_dy * pixel_quad_dir.y);
float phi_frac_dyx = ddy(phi_frac - phi_frac_dx * pixel_quad_dir.x);
// get the worst derivatives for the entire quad
phi_dx = max(abs(phi_dx), abs(phi_dxy));
phi_dy = max(abs(phi_dy), abs(phi_dyx));
phi_frac_dx = max(abs(phi_frac_dx), abs(phi_frac_dxy));
phi_frac_dy = max(abs(phi_frac_dy), abs(phi_frac_dyx));
// fwidth equivalents using the worst derivatives
float phi_fw = abs(phi_dx) + abs(phi_dy);
float phi_frac_fw = abs(phi_frac_dx) + abs(phi_frac_dy);
// construct uvs like Tarini's method
float2 uv = float2(
// uses a small bias to prefer the first 'UV set'
phi_fw - 0.0001 < phi_frac_fw ? phi : phi_frac,
theta);
fixed4 col = tex2D(_MainTex, uv);

This isn’t significantly different than the previous example, so all of the benefits still apply. Still potentially much faster, or at least not meaningfully slower, than the tex2Dgrad() method. Still supports anisotropic filtering and texture bias. And works regardless of if the shader derivatives are coarse or fine, or if the hardware mip level calculations are coarse or using the full quad.

In Quad Communcation

I feel it’s worth trying to explain the in quad communication technique a bit more. It can be very confusing to wrap your head around exactly what’s going on. It certainly was for me, and I wrote the above shader code.

Communicative Neighbors

Lets first focus on a single value and the basic derivatives. Here’s a single pixel quad group with a gradient ramp. The numbers below are the value of each pixel, and the index of the pixel within the pixel quad. I’m using Direct3D’s orientation here, so the first pixel, P0, is in the top left. Ignore that Direct3D ‘s HLSL defaults to coarse derivatives and pretend we’re getting fine derivatives for the rest of this. Feel free to mentally replace all of the ddx() and ddy() with ddx_fine() and ddy_fine().

pixel quad phi values and indices

With fine derivatives, the ddx() on P0 is calculated by subtracting the value of P0 from P1, and ddy() is subtracting P0 from P2. In this case ddx() is 0.165 and ddy() is -0.467. These are also the values those functions would return if using coarse derivatives for the entire quad.

P0 fine derivatives & coarse derivatives of phi

For fine derivatives, each column and row calculates unique derivatives by subtracting the earlier pixel values from the later ones. So the ddx() for P0 and P1 are 0.165, but for P2 and P3 it’s 0.208.

// get derivatives for phi
float phi_dx = ddx(phi);
float phi_dy = ddy(phi);
fine derivatives for each column and row of phi

Normally we only have access to the derivatives for the one pixel each invocation of the fragment shader is working on. P0 doesn’t know the ddx() of P2, P1 doesn’t know the ddy() of P0, etc. So how do we extrapolate from that to get all four derivatives? Well, you apply the derivatives we do know we can get the values of the two other pixels by subtracting or adding the derivatives to the current value based on which side of the pixel quad the pixel is on. And remember, we’re doing this to all four pixels in sync with each other. Each step we do in one pixel happens to the other three at the same time. In effect we’re mirroring the quad on the horizontal axis for one set of values and on the vertical axis for the other set of values.

// get position within the quad
int2 pixel_quad_pos = int2(vpos.xy) % 2;
// -1.0 or 1.0 value depending on which row or column this is
float2 pixel_quad_dir = float2(pixel_quad_pos) * 2.0 - 1.0;
// get the "other" pixel column / row values in the quad by
// adding or subtracting the derivatives from the current value
float phi_other_x = phi - phi_dx * pixelQuadDir.x;
float phi_other_y = phi - phi_dy * pixelQuadDir.y;
Applying derivatives to the current value to get the value of the other column or row. Top represents phi, bottom left is phi_other_x, bottom right is phi_other_y.

Then we get the derivatives again for both those new values, but on the other axis to the one we just added or subtracted. So ddy() on the values we added or subtracted ddx() from, and ddx() on the values we added or subtracted ddy() from. Because the extrapolated values have mirrored the values with in the quad, the new derivatives we get are the same that would have been for the “other” column and row.

// get derivatives the "other" pixel column / row in the quad
float phi_other_x_dy = ddy(phi_other_x);
float phi_other_y_dx = ddx(phi_other_y);
Getting the derivatives of the mirrored values gives each pixel access to all four derivatives in the quad. Top shows phi_dx & phi_dy derivatives, bottom shows phi_other_x_dy & phi_other_y_dx derivatives.

And thus each pixel now has access to all four derivatives in the quad and not just the two it should. At this point we also happen to have access to the values of three out of the four pixels in the pixel quad, and we could easily extrapolate the fourth. We don’t need to know that for this use case, but it’s there if you need it.

I compacted the last two steps into a single line in the original example shader code, and I’m using more terse variable names, but it’s doing the same thing.

Coarse Communication

To get all four derivatives we need access to fine derivatives in the shader. If you only have coarse derivatives, in quad communication won’t work. The technique relies heavily on the fact that each row and column can have unique derivatives, and that’s not the case for coarse derivatives. If you use the above code with coarse derivatives, the “other” derivatives will always end up matching the starting derivatives (within floating point accuracy limitations), so there’s no real harm in doing it. It’s just a lot of extra code to come out with the two derivatives you started with.

Luckily this isn’t really an issue since at this point you’ll likely only ever encounter coarse derivatives in Direct3D and for now desktop GPUs seem to still use coarse derivatives for the mip level calculations.

And thankfully you’ll never end up having to run Direct3D on an ARM Mali GPU where you could end up with coarse derivatives in the shader and full quad derivatives for mip level calculations. *nervous laugh*

* update: It turns out Mali defaults to coarse shader derivatives even in OpenGL! Yay.

Conclusion

I’ve presented four alternate approaches to Marco Tarini’s original method for fixing seams caused by procedural atan UV discontinuities. Which one should you choose?

The first two, explicit LOD and explicit gradients, will work universally as they don’t depend on the shader and mip level calculations using matching derivatives. Both have potential draw backs. Using an explicit LOD disables anisotropic filtering. Using explicit gradients can be a performance hit on some mobile GPUs and is slower than other options on Nvidia GPUs.

Coarse emulation is a neat trick, but ultimately kind of pointless to use. However the least worst quad derivatives approach appears to work nearly universally, and is either much faster or isn’t significantly slower than explicit gradients on the devices I’ve seen numbers for. So I feel comfortable in saying you should probably use that option for everything. Or at least everything not using Direct3D. edit: or Mali!?

update: Or if you want something that works universally, regardless of the accuracy of the derivatives or the mip level calculation, try this technique:

https://github.com/bgolus/EquirectangularSeamCorrection/issues/1#issuecomment-819290690

Also, maybe just use a cube map instead for this specific use case.

Or if you’re trying to render a skybox, I guess you can turn off mip maps. Just don’t do that for any other use case.

Shader Code

Here is a shader that includes all of the methods discussed in the above article, along with a few other settings. (direct link to gist.) This was originally written for the simple test project I used to do timings. The whole test project is available here: https://github.com/bgolus/EquirectangularSeamCorrection

Additional Thoughts

Shader Graph

Yes, all of these can be done in Unity’s Shader Graph. Some easier than others, since Unity doesn’t yet have a Sample Texture 2D Grad node to let you do the equivalent of tex2Dgrad() without a Custom Function node. I’d probably implement it as a Custom Function node anyway, one that takes a normal as an input and outputs a single UV using the Least Worst Quad Derivatives method. Technically there’s nothing stopping someone from reproducing it all with nothing but the built in nodes. It’s just not something I’m going to do.

Performance Numbers

I didn’t post the real numbers anywhere earlier in the article because they kind of don’t matter. The timings we’re talking about are so small that they’re very difficult to accurately profile. When profiling at μs (microsecond) time steps the numbers tend to jump around quite wildly, so take any of these with a grain of salt.

Note, all of the timings here were done using the coarse emulation version of quad communication. However when I profiled the least worst quad derivatives version I could not measure a difference between the two. I would expect it to perform about the same on other GPUs as well, but that is just an assumption.

Nvidia RTX 2080 Super:

  • 1k x 512 DXT1
    none — 25.8 μs
    tarini — 26.3 μs
    lod — 26.5 μs
    gradients — 26.7 μs
    quad comm — 26.6 μs
  • 2k x 1k DXT1
    none — 26.5 μs
    tarini — 26.8 μs
    lod — 27.0 μs
    gradients — 28.2 μs
    quad comm — 27.2 μs
  • 4k x 2k DXT1
    none — 27.1 μs
    tarini — 27.8 μs
    lod — 26.7 μs
    gradients — 36.5 μs
    quad comm — 27.9 μs
  • 8k x 4k RGBA32
    none — 44.0 μs
    tarini — 44.3 μs
    lod — 29.5 μs
    gradients — 63.2 μs
    quad comm — 44.5 μs

AMD RX 460:

  • 2k x 1k DXT1
    tarini — 183.84 μs
    gradients — 196.96 μs
    quad comm — 201.60 μs
  • 8k x 4k RGBA32
    tarini — 354.40 μs
    gradients — 352.16 μs
    quad comm — 354.24 μs

AMD RX 6800XT:

  • 2k x 1k DXT1
    tarini — 12.159 μs
    gradients — 12.621 μs
    quad comm — 18.744 μs
  • 8k x 4k RGBA32
    tarini — 25.603 μs
    gradients — 23.883 μs
    quad comm — 24.330μs

Apple M1 MacMini:

  • 8k x 4k RGBA32
    none — 338 μs
    tarini — 361 μs
    lod — 418 μs
    gradients — 266 μs
    quad comm — 291 μs

Again, don’t try to read too much into most of these numbers. They weren’t are measured in the same way, some are the entire draw call, some the entire frame, some only the fragment shader, so I wouldn’t try to compare the numbers from different GPUs directly. Differences that are less than +/-10% are more likely to be random noise than anything useful too. They’re all tiny values that you won’t really be able to see in the end framerate. I wanted them just to confirm that there wasn’t some really crazy performance issue with a particular method. Things like finding out explicit gradients aren’t significantly different for non Nvidia GPUs. And that explicit LOD maybe is for Apple GPUs.

ShaderToy Examples

I already linked to one of the ShaderToy shaders I wrote while working on this article. And if you happen to see all of the quadrants turn grey on some other GPU than Apple or Mali I’d love to know. But I also wrote one to test out the least worst quad derivatives in quad communication method. You can find it here and test it out yourself on different devices.

Note the seam only shows up in some directions for some methods / devices. So let it do the full rotation to make sure it’s really seam free. If you happen to have an ARM based Windows machine, I’d love to know the results.

Further reading

If you’re still fascinated by the deep rabbit hole that is the topic of GPU derivatives, here’s an article by Hans-Kristian on how much more confusing it all gets when you add in divergence flow control … i.e. conditional branches.

Tech Artist & Graphics Programmer lately focused on Unity VR game dev.