A cinematic animated comedy is supposed to be expensive. A writers' room. A voice cast. An animation team. A composer. A sound designer. An editor. The Doodle Cast — a YouTube show where two real dogs argue about the mailman in cinematic detail — is one person and a stack of AI models. The stack is called Showspring, and this is a walk through every AI service in it, the creative job each one does, and what the equivalent crew would look like.
The point is not "AI generated content." The point is AI as the production crew. The operator stays the director.
The whole pipeline on one screen
A single Doodle Cast episode touches roughly twelve distinct AI services across six vendors before it is a published video. Two LLM families (Claude and Gemini, plus Grok for web search) handle research, ideation, and scripting. A local image and video stack handles keyframes and animation. ElevenLabs voices the dogs, generates the music bed, and writes the sound effects. A multimodal planner watches the rendered cut and writes the audience reaction track. After publish, an auto-bible agent reads the final dialogue and proposes new canon back to the show bible. The bible compounds with every episode.
Each station is a named callsite with a routing intent. Cloud APIs are the fallback, not the default. The heavy creative work runs on a single local GPU box via the model vendors' local CLIs, which is what makes the math work for a one-person studio.
The director's chair: human-in-the-loop is sacred
Showspring has seven explicit approval gates. Nothing advances to the next station without an operator click.
- Topics gate — for news segments, the operator picks the topics that survive from the Grok + Gemini research pass.
- Idea gate — when generating an episode pitch, the operator picks the winning pitch from a multi-LLM tournament (more on that below).
- Script gate — the operator can hand-edit any clip, or type a free-text note ("make Oreo sound more cynical here") and a Claude agent will rewrite just that beat.
- Voice gate — every voice take is re-rollable. If a delivery is wrong, the operator can re-render or run the audio through speech-to-speech in the same character voice.
- Locations gate — environment image generation gets per-location approval before scenes are locked.
- Image gate — every keyframe is approvable per clip with a green-glow indicator on the locked ones.
- Mix gate — music, SFX, and the audience reaction track all land as editable rows in the audio mixer before the final render kicks off.
Two steps run unattended after the final approve: auto-publish at the algorithmically-picked time, and the auto-bible canon update. Both have a kill switch.
What this means in practice: AI generates the menu, the operator picks the meal. The interesting design decision is that Showspring is built around the operator's judgment, not around hiding it.
Agentic workflow #1: the dual-pitch tournament
Most AI tools pick one LLM per task. Showspring runs a tournament. Gemini 3.1 Pro proposes four divergent themes for the episode. For each theme, both Claude Opus and Gemini Pro write a full pitch in parallel. Eight pitches total. Two judges — Claude as the in-house judge, optionally Grok as a third-party second opinion — score every pitch on a rubric. The operator sees all eight pitches and both score sheets, then picks the winner.
This is multi-agent in the simplest sense that matters: multiple models with multiple roles, working in parallel. The diversity is the point. One LLM tends to converge on its own bias; eight pitches across two writers across four themes is a much wider net.
The same shape repeats for narrative segments (the Pack Tour travel format, the Adventures time-travel format). The tournament is one pattern reused across the app.
Agentic workflow #2: research, script, bible
Once a pitch is picked, the pipeline becomes a hand-off chain. Each AI does its bit, then passes to the next.
For a news-segment episode like the Fire Hydrant Gazette: Grok with web search pulls fresh news from the last few days. Gemini Pro converts the raw research into structured topic cards. The operator approves topics. Then Claude Opus writes the full script with the show bible, the character bible, the YouTube channel's recent retention curves, and the approved topics all loaded into context. The retention curves matter because Showspring stores the audience-retention shape of every previously published episode; when the LLM writes a new script, it sees "your hook drops 12% of viewers in the first 3 seconds — write a tighter opener."
After the episode is rendered and published, the loop closes. A separate Claude Opus agent reads the final dialogue and proposes one to three short bullets to add back to the canonical show bible. Voice tics that landed well. Recurring bits that emerged. Character traits that crystallized. The bullets land as a review queue in the operator UI; one click adds them to canon.
Future scripts inherit the accumulated lore. The bible compounds. This is the part of Showspring that no traditional production stack does — there's no human continuity editor in the loop, and the canon doesn't drift, because the AI is reading every episode and the operator is reviewing every diff.
Agentic workflow #3: the AI that watches the cut
This one is the most unusual integration in the stack. Every other production AI works from the script. Showspring has a multimodal planner that watches the final rendered video — keyframes, audio, transcript — and writes a JSON plan for the audience reaction layer.
It says things like: at 04:12 there's a beat after Oreo's delivery that should hold for half a second longer than the script implies, drop a chuckle there. At 08:31 the wide shot of Rusty looking up needs a small swell of music underneath, not a stinger. At 14:09 the gavel hit lands but the camera shakes too late for the audience to feel it; tighten the SFX timing by 2 frames.
The AI is consuming the same artifact the audience will. It can spot a beat that landed differently than the script implied. The output is a structured plan that the audio engineer (also software) executes.
Without this, an indie show ships without a laugh track or builds one from a click-track by hand. Either is hours of an editor's life per episode.
Voice acting: voices, not just synthesis
The Doodle Cast cast is locked. Rusty has the same deadpan baritone in every episode. Oreo has the same manic delivery. Bogart has the same Burt-Reynolds-in-retirement growl. These are ElevenLabs voice IDs assigned per character. Generating dialogue is matching the character to the voice ID and rendering.
The unusual part is voice replacement. When a video-generation model produces a clip with its own auto-generated audio (Veo does this), the audio is rarely the character voice you want. Showspring feeds the video's audio back through ElevenLabs speech-to-speech in the target character voice. The result is one finished video file with the right voice, not a separate dialogue track laid over a muted clip. Lip-sync survives. Room tone survives. The voice changes.
For guest dogs submitted by fans, the intake itself is agentic. A Claude pass reads the text intro. A Gemini pass reads the submitted photos to extract appearance. Another Claude pass smooths everything into bible-canon prose. A final Claude pass writes a 1-2 minute self-introduction script with inline audio direction tags. A Gemini pass picks an ElevenLabs voice that matches the personality. Five LLM passes across two modalities, and the guest is camera-ready.
Traditional production: cast 5+ voice actors. A booth. A direction session. Multiple takes. Pickups. Hundreds to thousands of dollars per character per session, with recurring costs every time a character appears.
Image generation: five engines, one visual brand
Showspring doesn't lock to one image model. The operator picks per scene from five engines.
The production default is Gemini 3.1 Flash Image because it handles multi-character reference photos. Every character has at least three reference images of the real dog. When the LLM-written scene description says "Rusty and Oreo at the broadcast desk," the image generator gets the prompt plus the three Rusty photos plus the three Oreo photos plus per-character "this one is left, this one is right" metadata. Rusty looks like Rusty across nineteen episodes because the same reference photos anchor every frame.
The other engines (SDXL, Qwen-Image-Edit, Z-Image-Turbo, LTX) are local on a single GPU box and are used when the operator wants stylistic variety or when the cloud quota is tight. Qwen-Image-Edit is especially good at editing existing keyframes — "make Oreo's expression more skeptical" while everything else stays the same.
An anti-slop instruction is appended to every Gemini image prompt: a list of "AI render tells to avoid" — perfect symmetry, glossy plastic surfaces, drone-style impossible camera angles, hyper-smooth fur. YouTube's algorithm penalizes detected AI slop. The instruction nudges the model away from the most-flagged patterns.
Motion verbs are stripped from the scene description before it hits image gen, because the still is the first frame of an image-to-video pipeline. "Rusty turns to look" produces a turned head that has nowhere to go in the video step. "Rusty looking" produces a frame with motion still ahead of it.
Traditional production: a storyboard artist. Hours of layout per scene. Multiple revisions.
Video: mix-and-match models
Three video engines, picked per shot. Google Veo for clips that need natural audio and lip-sync. Higgsfield Seedance for stylized motion and the Soul ID character lock that keeps Mimi's bat ears looking like Mimi's bat ears across shots. A local ComfyUI rig running WAN 2.2 for batch volume and test-lab experiments.
Why mix: each model has a different look and a different price point. A nineteen-episode season gets visual variety without locking to one vendor's aesthetic, and the operator can balance budget against quality per shot.
Traditional production: a 2D or 3D animation team. Days of work per finished minute. The bottleneck of the whole show.
Music, SFX, and the audience track
ElevenLabs sound generation writes music beds and SFX from text prompts. The text prompts themselves are written by Claude or Gemini ("Sparse Court TV tension cue, single sustained low-brass chord, no rhythm, 8 seconds, ducks 4 dB under dialogue"). The operator edits the description, the model regenerates the audio. This is a tighter feedback loop than working with a composer; the cost is that the operator becomes the music director.
For mixed shows, audience reactions (laughs, applause, gasps, groans) are generated by ElevenLabs and placed by the multimodal planner described earlier. Loudness is normalized per type — music to about -23 LUFS, SFX to about -18, audience around -20 — so the mix stays consistent across episodes.
A sidechain ducker runs at preview time and again at render time, so dialogue always sits cleanly above the music bed. ffmpeg's sidechaincompress filter does the final pass.
Traditional production: a composer, a sound designer, a mixer. Three roles, easily five figures per episode.
How the bible writes itself
The auto-bible accumulator deserves its own beat because it's the integration that makes Showspring different from a "scripts and voices" tool.
After every published episode, a Claude Opus agent reads the final dialogue end-to-end and proposes additions to the canonical show bible — the markdown document that every script LLM loads as primary context. The proposals are not "rewrite the bible." They are specific bullets pointed at specific sections: a new catchphrase that landed, a character habit that emerged ("Oreo licks the magic man's glove"), a recurring location that needs a paragraph in the setting list, a comedic mechanic the LLM noticed working ("Rusty's reaction to Oreo's interruptions is funnier than the interruption itself").
The operator reviews the proposals in a queue. One click adds them. Future scripts pull the new bullets in automatically.
Traditional production: a continuity editor or a showrunner who maintains a story bible by hand. Every episode is a chance for drift; every drift kills a callback that would have landed.
The local-vs-cloud routing
This is where the math works.
Every LLM call in Showspring has a named callsite — one for the idea agent, one for the script writer, one for music descriptions, one for the strategy report, and so on. Each callsite has a declared routing intent: which model class it wants, with cloud as the fallback rather than the default.
A single GPU box runs Claude Opus 4 via its local CLI, plus Gemini 3.1 Pro via its local CLI, plus a small army of open-source image and video models. When a callsite says "Claude Opus," the router calls the local CLI first and only falls through to the cloud API if the local path is unreachable.
When the local Claude CLI hits its rolling rate window, the router auto-swaps every Claude callsite over to local Gemini Pro for the next few hours. By the time Claude reopens, the work for the day is done.
The unlock: a heavy creative pass runs against a flat-rate local bucket, not a metered API charge. Without this, the same workflow at API rates would be untenable for a solo creator.
What's unique here
A few patterns that haven't shown up in other AI production tools:
- The tournament with two judges. Most tools pick one model. Showspring runs eight candidates and two evaluators in parallel because diversity is the point.
- Voice replacement on the video, not over it. The video's own audio gets re-voiced in the target character — lip-sync, room tone, everything survives.
- The AI that watches the cut. A multimodal planner consumes the rendered video plus keyframes and writes a beat-aware reaction plan. Not just script-aware.
- The show bible compounds. A post-publish agent proposes canon back to the bible after every episode. The lore grows with every release.
- Anti-slop baked in. Every image prompt appends a "tells to avoid" list that pushes the model away from the patterns YouTube's algorithm penalizes.
- Per-callsite routing. Cloud is the fallback. Heavy creative work runs through the model vendors' local CLIs on a local GPU box.
Without any of this
A cinematic animated comedy episode in a traditional studio runs roughly ten people through three to six weeks of work. Writers' room, continuity editor, voice cast, animation team, composer, sound designer, editor, social manager. Five-figure budgets are the low end for indie animated content.
The Doodle Cast is one person and the stack above. Episodes ship every few days. Nineteen long-form episodes in five months, plus the shorts pipeline humming next to it.
The unlock is not that the AI is creative. The unlock is that the AI is the crew. The director's chair is still human. And the seven approval gates are what keep it that way.
If you're a creator with a strong director's instinct and no studio, this is the shape the work takes now. Showspring is one expression of it. The pattern travels.
Showspring is in private beta. If you're building something similar or want to see what one person can ship with this stack, reach out.
Discussion
Be the first to comment