The Bake-Off That Beat the Models: An Image-Consistency Eval

A twelve-cell reference board: characters, a cast lineup at relative scale, and full plus detail prop views, generated in one image call. — The v2 reference board. Twelve cells in a single image call, about $0.24. It turned out to be optional.

The hard part of AI video is not the video. It is the stills. You write a scene, attach reference photos of your characters and props, and the first shot looks right. The next one drifts. The cart changes shape, a coat changes color, the set rearranges itself between cuts. Reference images help. They do not lock anything down.

I run a small AI-animated show in my own time, and cross-scene consistency was the thing breaking the illusion. So instead of guessing, I ran a controlled bake-off on one real episode: eight scenes, the same scenes through every lane, judged head to head by independent vision panels with an adversarial pass that tried to refute each finding.

The lanes

Four lanes rendered the same eight scenes. The split was along two axes: which model did the rendering, and what discipline was applied to the prompt and the references.

Lane	Engine	Mechanism	Cost (8 scenes)	Latency
A2	Nano Banana 2	Hygiene fixes only: clean prompts, prop canon as text, cleaned refs	$0.54	11.5s
B2	Nano Banana Pro	Same, plus a 12-cell reference board	$1.07 + $0.24	21.5s
S	Nano Banana Pro	One 4K storyboard sheet, then per-scene refinement	$1.07 + $0.24	18.4s + sheet
C	gpt-image-2	The previous round's winner, the bar to beat	~$1.70 equiv	~3 min

Result

The cheapest lane won the whole thing. Plain Nano Banana 2 at about $0.067 an image, with two fixes that cost nothing, ranked first on all four scored dimensions: prop consistency, character likeness, scene fidelity, and artifacts. It beat the premium board pipeline, the storyboard pipeline, and the previous round's gpt-image-2 on those four. One thing the scorecard flattens: "prop consistency" here means holding a prop's canonical design across shots, not raw shot-to-shot continuity. gpt-image-2 was actually the strongest lane at carrying a layout from one frame to the next. Its loss was on the props and the faces that have to match my canon, not on that continuity.

The two free fixes

State the canon in text. When a prop's canonical detail was written into the scene prompt in plain words, the model rendered it correctly across every shot. When it was left to the reference photo alone, it drifted. One prop's consistency score went from 4 out of 10 to 8 out of 10 on that change alone. Showing the thing is weaker than stating it.

Clean the prompt. The watermark artifacts I had been blaming on the model were coming from my own prompts. The show name was baked into an image-style setting and had ridden along into 98 saved prompts, so the model kept trying to render that text into scenes. One data migration fixed it. Of 146 reference images, only four actually carried a mark.

Nano Banana 2 render of the cabin interior with the correct brass U-yoke. — Nano Banana 2 (winner): correct brass yoke

gpt-image-2 render of the same cabin with a round steering wheel. — Nano Banana 2 (winner): correct brass yoke

The other three lanes

The reference board came second on every dimension and costs roughly 2.5 times the cheap lane. Worth keeping for episodes with large casts, where a lineup cell helps with scale. Skip it otherwise; the prop-canon text does most of the work on its own.

The storyboard sheet came last. The sheet itself looked internally consistent, but any flaw baked into a panel, a stray painterly style, gibberish text, one wrong dog, propagated into every refined frame. As an unattended step it amplifies errors instead of containing them. It only makes sense with a human approval gate: eyeball the $0.24 sheet, re-roll until it is right, then fan out.

Lane	Props	Likeness	Fidelity	Artifacts	Avg
A2 (NB2 + fixes)	8	8	8	7	7.8
B2 (NB Pro + board)	6	8	7	7	7.0
S (storyboard)	5	6	5	6	5.5

For reference, the same pipeline scored 5.5 before the fixes. The 7.8 is the measure of two free changes.

Why the expensive model lost the dogs

gpt-image-2 produced the best composition in the test and the worst character fidelity, and those two facts have one cause. It does not paste your reference pixels into a new layout; it re-synthesizes the whole frame from the conditioning signal. The coarse, structural parts reconstruct faithfully, where the dashboard sits, the brass yoke, the rope, the map, the geometry of the cabin, while the fine-grained, identity-bearing parts, a specific dog's skull shape, muzzle length, and exact markings, are regenerated through the model's learned prior and slide toward its center of mass. Coarse scene structure survives re-synthesis; identity-bearing detail, down to the fine geometry of a specific skull and muzzle, holds only as far as the references pin it down. Where coverage is thin the prior fills in, which is how a single model won prop consistency and lost character likeness in the same bake-off.

It breaks first on the guest cast, which carries the least reference coverage. Esme is a black-and-tan dog with tan eyebrows, a white chest blaze, and a slim tapered muzzle. Given the same porch scene and the same references, the cheap lane kept her on model while gpt-image-2 returned a fawn, flat-faced pug type with a dark mask. Not a stylized Esme but a different individual: the coat inverts from black-dominant to fawn-dominant and the muzzle goes from tapered to brachycephalic, which is breed identity, not rendering style. With thin per-character anchoring there is nothing to hold the line and the prior wins outright.

Canon reference photo of Esme, a black-and-tan dog with tan eyebrows and a white chest blaze. — Canon: Esme

Nano Banana 2 render of the porch scene; the dog at the door is on-model black and tan. — Canon: Esme

The two leads went the other way, and that is the tell. They carry heavy per-character reference coverage, and it holds. I gave gpt-image-2 three escalating tries to lock them: six reference photos plus corrected trait text, then "reproduce this exact frame," then "change only the map, keep the dogs exactly as they are." Across all three the leads stay on model. Identity survives the re-synthesis exactly as far as the references reach, and the guests ride along on a fraction of that coverage, which is why the prior overwrites them and not the leads. So the ceiling is not that identity is unreachable; it is that holding it costs reference coverage you cannot spend on every character in every scene. The heavy way around it is weight-level personalization, a LoRA or trained embedding per character, which does not scale to a rotating guest cast. The one cheap signal that does hold identity is the previous frame itself.

Composition transferred cleanly. Identity held exactly as far as the references reached, and no further. That is structural to a regenerate-from-scratch path, not a prompting gap.

The target frame from Nano Banana 2 with the correct dogs. — Target (Nano Banana 2)

gpt-image-2 attempt one with maximum references. — Target (Nano Banana 2)

The technique worth keeping: chaining

Chaining does not stop the re-synthesis, it re-anchors it. Instead of conditioning on canon photos that sit far from the target, condition each shot on the previous frame, the nearest possible reference: same cabin, same dogs, with the description controlling only what changes. Identity is carried forward rather than re-derived from the prior, so the interior, the props and the dogs hold while the scripted change still lands. One call, about $0.13. The catch is that each frame also inherits the last one's small errors, so identity ratchets over a long sequence; this is the answer for short interior and same-location runs, not for a whole episode generated cold.

Scene nine, the source frame. — Scene 9 (source frame)

Scene ten generated as a continuation of scene nine. — Scene 9 (source frame)

What shipped

Three changes went into the pipeline: prop canon text in every scene prompt, the brand text stripped out of all saved prompts and reference images, and an optional reference board for cast-heavy episodes. New episodes get the winning behavior by default. The regression to watch is the cheap one: keep show branding, channel names and meta-text out of image prompts and style settings. In this test, prompt hygiene mattered more than any model.

Most consistency problems are discipline problems wearing a budget-line costume. Run the eval before you buy the upgrade.

Watch the episode

These frames are from The Doodle Cast, an AI-animated show I run in my own time. Here is the full episode they come from, so the consistency work above has something to point at.

More episodes at thedoodlecast.com and on the YouTube channel.

#ai #image-generation #evaluation #consistency #prompting #gemini #gpt-image #video

The bake-off that beat the models

The lanes

The two free fixes

The other three lanes

Why the expensive model lost the dogs

The technique worth keeping: chaining

What shipped

Watch the episode

1 Comment

Join the discussion

The bake-off that beat the models

The lanes

The two free fixes

The other three lanes

Why the expensive model lost the dogs

The technique worth keeping: chaining

What shipped

Watch the episode

1 Comment

Join the discussion

Related Articles

TheDoodleCast: An AI-Animated Podcast Hosted Entirely by Dogs