What Happens in the AI Image Generator: The Diffusion models in the real world

The weird enchantment of making motionlessness an image

Whenever you submit a sentence to an uncensored image generator powered by ai in order to convert it into a painting, something truly strange is going on behind the scenes. You write a handful of words. A second later – a wolf on a neon-roofed Tokyo rooftop, or a Victorian portraits of a cat with a latte. It is as though magic. It is not. It is math that does something utterly counterintuitive which is to learn to reverse chaos.

Nearly all popular image synthesis tools in the past few years are based on diffusion models. They do not operate in the same way as older generative systems – there is no pair of networks in conflict, no bottleneck compressing meaning into a small latent code. Rather, diffusion is a long, slow path. And patience is just the thing that makes it powerful.

Let’s dig in.

The meaning of diffusion

It is a name of physics. Put a dye tablet in motionless water and observe it begin to spread– the beginning of the spreading, the gradual decay of the structure into the scattered, is diffusion. Its math has been known more than 100 years.

The same concept works in a diffusion model of images. Take a photograph. Added a small amount of Gaussian noise. Add more. Continue adding, adding, until you have a uniform grey and no longer see the picture, only a field of static. Repeat this a thousand times and you have completed the forward process. The image has diffused away into nothing.

The trick is that here a neural network is trained to reverse this process. We have the noisy image at step t, and we predict the slightly less noisy image at step t -1. Repeat this and repeat it, and you are up in noise, and you come out at the other end with a picture that makes sense. The original image is never presented to the network during inference – instead it recreates something plausible based on statistical patterns that it learned during training.

That’s it. That is the main point. The remaining is engineering, scale and some very ingenious additions.

The entry of text into the picture

The simplest diffusion model that randomly generates images is interesting but not particularly useful. The discovery that turned these models into instruments was conditioning – giving the model more information which tells it what to produce.

Text conditioning is achieved by a mechanism known as cross-attention. The text prompt is then inputted through a language model (typically a variant of CLIP or a text-image transformer trained to operate on text prompts), which is then converted into a sequence of embedded vectors. These embeddings contain meaning not only keywords, but also relationships and context.

Within the U-Net, at various layers, cross-attention modules contrast the features of an image at every spatial location with the text embeddings. Relevant positions are pulled towards the relevant concept. At each denoising step, it does a little bit of pushing of the flow of the generation towards what you describe.

Consider it as navigation. You are initially randomly in image space. Your text prompt is your destination. At every step, the model compares itself with the position it is supposed to be having where the prompt indicates it is supposed to be and makes modifications. The more enlightening the prompt, the narrower the way.

Latent diffusion: the performance gain

A diffusion process requires painfully slow execution on full-resolution pictures. Every denoising process is a forward pass of a large neural network. Do it 512,512 times and you are wasting a lot of compute to obtain a single image.

Latent diffusion transfers the whole procedure to a condensed representation. A different autoencoder then trains itself to encode images into a significantly smaller latent space – usually 8 or 16 times smaller in the spatial dimensions. The diffusion process then runs in that compressed space. The decoder finally at the very end recreates the latent in the entire image.

This is what enables real-time or near-real-time generation to run on consumer hardware. The diffusion model does not need to do full-resolution processing of pixels. It simply shuffles over comparatively small latent tensors, and the decoder does the hard work of restoring visual detail.

The tradeoff is that the encoder can occasionally produce artifacts or a little bit blurry fine textures. However, the speed increase is so drastic that this architecture, first introduced in what the research community terms as LDMs became the foundation behind most modern tools practically right after the first papers.

The ethical aspect: the layer that is not generated

Diffusion models are trained on data. In case the training data includes a form of image, the model might be able to produce it. This posed an immediate dilemma to the tools that face the world: how do you roll out a model that has been trained on an unfiltered sample of the internet (without it producing an objectively harmful sample output)?

Combinations of fine-tuning and filtering are the answer to most commercial tools. During training (RLHF or similar models), the model is guided away and requests that trigger an apparent wire are intercepted by prompt filters. Constriction of restriction is vastly different among various services.

There are also open-source variations that are intentionally unrestrained – what is sometimes referred to as uncensored image generator – but the level of filtering used by commercial vendors varies. There is no rightly-so method to either. The uncontrolled tools allow creative freedom and exploration that constrained tools prevent, the constrained tools minimize damage in application environments where there is no moderation infrastructure. It is a actual tradeoff, not a mere policy failure on either side.

Domain-specific models are also obtained through fine-tuning. Take an existing base diffusion model and then additional train on several hundred images in the style of a given artist, on a specific product, or on a small diversity of visual vocabulary. The model does not lose all of the experience that it has previously acquired, it simply gets a powerful new attractor in a specific direction. This is further cheapened by LoRA (Low-Rank Adaptation) which can insert small weight changes without remaking the entire network.