What’s Under The Hood Of AI Face Swap Models

AI Image Generator

Have you ever swapped your face onto a celebrity’s body with an ai face swap free app and wondered “Interesting, but how does it work”? People use this software, get a kick out of it, and that’s it. But the technology? Well, that’s where it gets interesting. Let’s strip it down.

ai face swap

It All Starts with Data – Mountains of It

Before the engineers can write a line of training code, they need data. Face swap models are hungry. We’re talking tens of millions of images, sometimes more. Different angles, lighting, skin tones, facial expressions and occlusions (that’s a fancy way to say “stuff in front of your face – a hand, a scarf, strange shadows”).

And the quality of the dataset is going to have a big impact on the quality of the model. Garbage in, garbage out. It’s that simple.

Here’s an analogy: if you want to train a face swap model, but only with blurry photos, it’s like teaching someone to paint only using blurry photos as a reference. They will produce something, but you probably won’t be buying it.

The Generative Adversarial Network: The Odd Couple

Most ai face swap models use something called a GAN: Generative Adversarial Network. A generator and a discriminator, two neural networks, arguing whether the generator is good or not.

The generator says, “Look, this face swap is real!”

The discriminator says, “No it’s not, the jawline is wrong.”

And so they continue, tens of thousands of times a second, until the generator is good enough to trick the discriminator most of the time. The result? Convincingly blended faces.

This training requires some serious gear. Many expensive GPUs for days or weeks on end. I’d pay a fortune in electricity.

Landmark Detection: Mapping Out the Face Like a Topographic Map

If we are going to swap faces, we need to know where the face is in the first place. Landmark detection locates dozens of landmarks – the corners of the eyes, the tip of the nose, the corners of the mouth.

ai face swap

These aren’t approximate guesses. Today’s approaches can detect 68 or more of these with sub-pixel precision. This is what prevents the model from putting an eye where an ear belongs.

Other methods use heatmaps rather than coordinate regression. Others combine both. The key is that without this, nothing works.

3D Modeling: Faces Aren’t Flat (Shocking, Right?)

This is where most of the earlier face swap systems failed. A face isn’t flat. It curves. The nose sticks out. The cheeks pull back. The shape changes when you tilt your head.

Today’s models include 3D Morphable Models, or 3DMMs. These generate a 3D statistical model of the face shape, and the face swap can take into account pose and angle. This is why some of the tools out there today look so very real, even when the source is at a radically different pose to the target.

If you don’t account for 3D, you get what is sometimes referred to as a “cutout” look – as if the face was pasted onto the body.

Blending and Refinement: The Final Polly

Getting the geometry right is only half the battle. The skin tones have to match. The lighting needs to be correct. The edges can’t look like they were cut out with a pair of scissors.

The blending is expensive to do properly, and you can tell when it’s not.

This is where post-processing networks come in. Some networks include refinement networks trained to correct color, edge and shadow. Some others do this in the main generation pass.

Many of the face swap free tools you can find today do it cheaply. You can see that their replacements look poor in bright sunlight or when the target face and source face have different skin tones.

Encoder-Decoder Networks: What is it About a Face?

Besides GANs, another popular architecture is encoder-decoder. This takes an image and turns it into a fingerprint. The decoder then uses the fingerprint to generate a face.

Train two encoders on different faces, but the same decoder? Swap the encoded representations. The decoder then produces a reconstruction with one person’s shape but the other person’s identity.

It seems so neat. In reality, it’s really difficult to ensure the decoder can handle both identities – and doesn’t combine them into some grotesque and unidentifiable monster.

Loss Functions: How to Tell a Good Output

Have you ever heard of loss functions? They quantify how bad its output is. The lower, the better.

Training for face swap usually involves multiple losses:

A perceptual loss ensures that the feature representation of the generated image and the target image are similar. A pixel loss compares the actual images themselves. An identity loss verifies that the face still resembles the original person. An adversarial loss is the output of that discriminator network we mentioned above.

It’s a real challenge to balance these four factors. Give the identity loss too much weight and the face will still look like the original. Weight it too low and the result becomes an unrecognisable mush.

Optimise for Speed: Fitting on Your Phone

Most of the training described so far happens on server farms. The trick is to distill that knowledge into something that can run on a mobile phone.

ai face swap

Knowledge distillation (where you train a small network to copy a large network), quantization (lowering the precision of numbers) and pruning (removing unneccessary connections in the network) all contribute. You’re trying to push a watermelon through a garden hose without too much juice loss.

That some apps do this in close to real time on mid-tiers phones is, frankly, impressive.

The difference between “it puts a face on another face” and “it does so in a way that tricks a trained human eye” is years of work in computer vision, deep learning, 3D modelling, and signal processing. Every piece has to work. And when it does it feels slightly uncanny – which, come to think of it, is another way to describe a photograph that looks real but isn’t.