AI Image Generator: Why Hands and Faces Continue to Ruin the illusion

The Uncanny Valley Has a Speech, and It Lives in Your Fingers.

Request any AI image generator to create a confident businessman shaking hands over a boardroom table. Go ahead. The suit will be immaculate. The lighting will be film-like. The backdrop will buzz with a just sufficient corporate indecision. And then you will see the hands – and then you are gazing at a biology book monster. Seven fingers. A thumb growing from the palm. Knuckles pointing in directions knuckles are not pointy.

This isn’t a quirk. It’s a structural problem. And it has been at the heart of AI image generation since years past, annoying designers, amusing bystanders and silently embarrassing all engineers who believed they had the heavy lifting out of the way.

The Model Sees Forms, Not Figures

This is what is going on with the hood. Such systems are not aware of anatomy in the sense that a figure-drawing teacher is. They are not aware that a hand has five fingers since evolution made that choice 375 million years ago. They are aware that pixels shaped in this way are more likely to seem close to pixels shaped in that way – since they have seen millions of pictures in which that was the case.

The problem with hands is that they’re shape-shifters. A hand with a cup of coffee is virtually a hand waving goodbye. Fists have virtually nothing in common with open palms. Fingers are overlapping. They are concealed behind objects. They shorten aggressively when they point at the camera. In the model each new hand pose is a new puzzle, and the model is solving the puzzle by averaging on a distribution of training images, not by referring to a skeletal model.

Faces are more indulgent, but to a certain extent.

Why Faces are easier (Mostly)

The internet is awash with portraits. Selfies, profile pictures, headshots, stock images, social media thumbnails – an AI that is trained on web data identifies human faces in approximately the same order billions of times. Direct or almost direct, symmetrical, a relatively predictable position of two eyes, a single nose, a single mouth.

This repetition is in fact a plus. The model forms a sort of a good guess as to how faces appear. It becomes very skillful in making them. Skin texture? Impressive. Eye catchlights? Nailed. The light that goes round a cheekbone that way? Consistently beautiful.

But take the limits off those riddled co-ordinates and the cracks are quick. Rotate face three-quarters and teeth begin to multiply like rabbits. Request a drastic upward angle and the proportions run off a cliff. Demand an old face with deep lines and the model occasionally flattens them into ornamental hints as opposed to worn landscape. The model has encountered enough unusual faces that it is aware of their existence, but not enough to internalize their geometry.

The frontal portrait in particular has become relatively well-trained by tools such as an ai profile picture generator free due to the large weighting of the training data towards that angle. Take a lateral step out of it and you are pushing on the limits of what the model has literally learned.

The Occlusion Issue that No One Speaks of.

The following is a case that will definitely cause a generative system to overheat: two individuals embracing.

Arms are crossing and crossing. Shoulder wrapping fingers. Part of a hand of one person behind the coat of another person. Bodies that overlap to the extent that any 3D artist would spend hours planning his mesh. The AI must produce all this at the same time, without an internal representation of what is opaque and what is visible – since there is no underlying geometry. There’s just pixels.

This is the occlusion problem. It is difficult to make a hand that is complete. It is a quite different order of hard to produce a hand which is 40 per cent covered by a jacket sleeve and still read as anatomically plausible. The model must deduce what the hidden fingers would be doing without necessarily drawing them out – and it must give the appearance of a natural decision on the outside. It fudges more than not. The outcome is a blur, an extra thumb, or a finger that comes out of the opposite side of the wrist.

Low Training Signal, High Frequency Detail

Apply zoom to any AI generated image and you will see a pattern: big shapes are treated in a good way, small shapes are treated in a bad way. This isn’t coincidence. It is a real thing in regard to diffusion model learning.

The model maximizes a large number of pixels simultaneously. The general composition, in which the figure is placed, the way light falls on it, what the rough form of a person’s torso is, all these things pay a great dividend in training. This is a proportionally small reward when the precise configuration of a pinky knuckle is achieved. The training signal does not simply underline the fineness of anatomical rightness with the same vehemence that it underlines the overall compositional coherence.

The hands are virtually high-frequency detail. Each finger is narrow. Joints are small. The exact location of each fingertip in relation to the others carries meaning – a hand resembles a relaxed, a tense, a graceful or a broken hand. This is also the case with faces, albeit to a smaller extent. The distinction between the face that is read as beautiful and the face that is read as slightly misshapen can be in a millimeter or two of nostril asymmetry.

The Philosophical Bits (Don’t Skip It)

Something really interesting is lurking in this problem. The most expressive areas of the human body are hands and faces. They are the place we demonstrate purpose, feeling, relationship, position. They’re dense with meaning.

A sunset doesn’t need to be anatomically correct. The mountain range does not bear social information. But a hand gesture can have a hundred different meanings according to minute changes in placement. A face may change as sweet to evil with the inappropriate shadow at the inappropriate angle.

The AI image generation domain is fundamentally learning to copy the aspects of human expression with the most semantic content in them – and finding in the process that these are the aspects of human expression that are most difficult to imitate. Not due to the impossibility of the math. Since the subject matter is in a very literal sense, the most human thing there is.

That is a sort of philosophical karma, you see.

Have you run into especially cursed AI-generated hands lately? The seven-fingered handshake is by now more of a genre.