Create Studio: Hybrid Cloud-GPU Architecture for AI Video Generation

Editor’s note: Create Studio was an early exploration of one-person AI ad pipelines, and this write-up documents the architecture as it was built (the model names reflect that era). The production-grade successor of these ideas is Showspring, the AI video production studio behind The Doodle Cast.

1. System Overview

Create Studio is an AI-powered video ad generator that allows users to upload photos of businesses, arrange them on an NLE-style timeline, add AI-generated voiceover and music, and render broadcast-quality video ads — all from a browser.

The system runs on a hybrid architecture spanning two physical locations: a DigitalOcean VPS in the cloud and a local workstation with an NVIDIA RTX 5090 GPU. The architecture is designed so that every feature works when the GPU PC is offline, with the 5090 providing enhanced performance when available.

Fig 1 — High-Level Architecture

CLOUD VPS LOCAL GPU HOST Express Server Node.js Apache Proxy SSL + Reverse Proxy FFmpeg VPS Fallback Render Google Drive OAuth2 Storage ElevenLabs API TTS + Music Cloud Google Veo 3 AI Video (Cloud) Gemini API Script Gen Fallback Jobs & Queue data/jobs.json Flask API Python Cloudflare Tunnel images.thedoodlecast FFmpeg 5090 GPU Render SDXL Image Generation ACE-Step 1.5 Music Generation MusicGen Music Fallback Wan2.1 14B Video Gen (Local) Ollama Script Generation HTTPS HTTP Upload

Key Design Principles

Graceful degradation — Every feature works without the 5090. Cloud APIs (ElevenLabs, Veo 3, Gemini) provide fallbacks.
Transparent switching — The is5090Online() health check (30s cache) automatically routes to the best available backend.
Zero downtime transitions — The 5090 can go offline mid-session (e.g., user starts sim racing) without breaking in-progress renders.

2. Request Flow & Routing

Every request from the browser follows this path:

Fig 2 — Request Flow

Browser User HTTPS Apache :443 SSL proxy Express Node.js 5090 online? 5090 offline? 5090 Flask GPU Processing Cloud APIs ElevenLabs/Veo/Gemini Result

Session & Authentication

Multi-user auth with session cookies. POST /auth validates credentials, stores req.session.user. All API routes check requireAuth middleware. Gallery items are tagged per-user for filtering.

// server.js — Auth flow
const USERS = { rick: 'rick', jens: 'jens' };

app.post('/auth', (req, res) => {
  const { username, password } = req.body;
  if (USERS[username] === password) {
    req.session.user = username;
    res.json({ success: true, user: username });
  }
});

3. GPU Health Check & Service Discovery

The VPS continuously monitors the 5090's availability through a cached health check. This is the foundation of the hybrid architecture — every subsystem queries this before deciding where to route work.

Fig 3 — Health Check State Machine

ONLINE gpu5090Online=true OFFLINE gpu5090Online=false /health timeout or error /health returns 200 OK 30s cache (GPU_CHECK)

// Cached 5090 health check
let gpu5090Online = false;
let gpu5090LastCheck = 0;
const GPU_CHECK_INTERVAL = 30000; // 30 seconds

async function is5090Online() {
  if (Date.now() - gpu5090LastCheck < GPU_CHECK_INTERVAL)
    return gpu5090Online;
  try {
    const r = await fetch(`${IMAGE_API_URL}/health`, {
      signal: AbortSignal.timeout(5000)
    });
    gpu5090Online = r.ok;
  } catch {
    gpu5090Online = false;
  }
  gpu5090LastCheck = Date.now();
  return gpu5090Online;
}

The 30-second cache prevents hammering the health endpoint during burst operations while keeping the state reasonably fresh. The 5-second timeout prevents the check itself from blocking render pipelines.

4. Video Rendering Pipeline

The render pipeline is the most complex subsystem. It builds an FFmpeg filter_complex graph from the timeline state, then decides where to execute it based on 5090 availability.

Fig 4 — Render Pipeline Decision Tree

Generate Video POST /api/generate Validate Clips 10 clips max TTS Voiceover ElevenLabs API Generate Music Hybrid Dispatcher AI Clips (if any) Veo 3 / Wan2.1 5090 Online? 5090 FFmpeg HTTP offload + upload YES VPS FFmpeg Local render NO fallback if 5090 render fails Upload to Drive Google Drive API

4.1 FFmpeg Offload to 5090

When the 5090 is online, the VPS offloads FFmpeg rendering for dramatically faster encode times (GPU-accelerated vs. VPS CPU). The process works via a fully HTTP-based transfer protocol — no SSH/SCP required:

Fig 5 — FFmpeg Offload Sequence

VPS (Express) Cloudflare Tunnel 5090 (Flask) 1. Build filter_complex 2. Create render token + file download URLs 3. POST /render-video (6KB JSON) 4. GET /api/render-file/{token}/{idx} (x12) 5. Run FFmpeg libx264, drawtext, xfade 6. GET /render-status/{jobId} (poll 3s) { status: "rendering", progress: 45 } 7. PUT /api/render-upload/{token} (video bytes) 8. { status: "done", progress: 100 } 9. Upload to Google Drive

Font Path Mapping

The VPS builds filter_complex strings with Linux font paths. Before sending to the 5090, these are mapped to Windows equivalents with escaped colons (since : is a delimiter in FFmpeg filter syntax):

const WIN_FONT_MAP = {
  '/usr/share/fonts/.../OpenSans-Bold.ttf':
    'C\:/Windows/Fonts/arialbd.ttf',
  '/usr/share/fonts/.../DejaVuSerif.ttf':
    'C\:/Windows/Fonts/times.ttf',
  // ... 12 mappings total
};

5. Hybrid Music Generation

Music generation uses a dispatcher pattern that routes to the best available engine based on user preference and 5090 availability.

Fig 6 — Music Engine Routing

Music Request prompt + duration Engine Select ACE-Step 1.5 5090 via Flask API local ElevenLabs Cloud SFX API elevenlabs Auto Mode Logic 5090 online? → ACE-Step 5090 offline? → ElevenLabs local + offline? → auto-fallback auto WAV MP3

ACE-Step 1.5 (Local)

Free, unlimited generations
Better quality for longer tracks
Lazy-loaded on first request (~60s)
Retry loop: 7 attempts, 15s intervals
Output: WAV, variable length

ElevenLabs (Cloud)

Pay-per-generation
Always available, no warmup
Max ~22 seconds per generation
API: POST /v1/sound-generation
Output: MP3, fixed duration

6. AI Video Clip Generation

Create Studio supports three clip modes: Ken Burns (zoom/pan on static images), AI Image-to-Video (I2V), and AI Text-to-Video (T2V). The AI modes use different backends based on availability:

Fig 7 — AI Video Generation Matrix

Model	Location	I2V	T2V	Duration	Availability
Google Veo 3	Cloud (Google API)	✓	✓	4-8 seconds	Always
Wan2.1 14B	Local (5090 GPU)	✓	✓ (via SDXL)	~5 seconds	5090 only

The UI dynamically shows/hides the Wan2.1 option based on 5090 status. When offline, clips default to Veo 3 and the "Advanced" toggle is hidden. The VPS can call Veo 3 directly via the Google Generative AI REST API without needing the 5090.

7. LLM Script Generation

Voiceover scripts and ad copy are generated via LLM with a two-tier fallback:

Fig 8 — LLM Fallback Chain

callLLM() prompt + timeout try first Ollama 5090 / llama3.2 fail/timeout Gemini Flash Cloud API Ollama: 8s timeout (AbortController) Gemini: 15s timeout, model: gemini-2.0-flash

async function callLLM(prompt, timeoutMs = 15000) {
  // Try Ollama first (local, free, fast when available)
  try {
    const controller = new AbortController();
    const timer = setTimeout(() => controller.abort(), 8000);
    const res = await fetch(OLLAMA_URL, { ... });
    clearTimeout(timer);
    if (res.ok) return (await res.json()).response?.trim();
  } catch {}
  // Fallback: Gemini API (always available)
  try {
    const res = await fetch(
      `https://generativelanguage.googleapis.com/v1beta/
       models/gemini-2.0-flash:generateContent?key=${KEY}`,
      { method: 'POST', body: JSON.stringify({ contents: [{ parts: [{ text: prompt }] }] }) }
    );
    if (res.ok) return data?.candidates?.[0]?.content?.parts?.[0]?.text?.trim();
  } catch {}
  return null;
}

8. Job Lifecycle & Data Model

Fig 9 — Job State Machine

pending processing done Google Drive upload + cleanup error paused cancel cancelled

// Job data shape (data/jobs.json)
{
  id: "uuid",
  user: "jens",
  brandName: "Joe's Pizza",
  status: "processing",    // pending | processing | done | error | paused
  step: "Rendering on 5090 GPU PC...",
  progress: 67,            // 0-100
  outputUrl: "/videos/uuid.mp4",
  error: null,
  createdAt: "2026-03-13T...",
  completedAt: null,
  comment: "",
  rating: 0,
  driveFileId: "1abc...",
  driveFileSize: 45000000,
  warnings: []             // AI clip fallback warnings
}

9. NLE Timeline Editor

The editor UI follows a traditional non-linear editing layout with a media library panel and a multi-track timeline:

Fig 10 — NLE Editor Layout

Media Library 250px fixed Upload My Library persistent per-user data/user-photos.json Project Photos Google Drive backed AI Scenes T2V placeholders ← drag to timeline → MIME: application/ x-media-photo Timeline fluid width VIDEO VOICE MUSIC Clip 1 (4s) AI (5s) Clip 3 (4s) Clip 4 T2V (6s) Voiceover — draggable + resizable Background Music — full timeline width 0s 10s 20s 30s 40s

10. Deployment & Infrastructure

VPS (DigitalOcean)

OS	Ubuntu 24.04 LTS
Runtime	Node.js + PM2 (cluster)
Proxy	Apache 2.4 + SSL
Port	internal
Domain	create.overdigital.ai
Storage	Google Drive (OAuth2)
Deploy	SCP + PM2 restart

5090 GPU Workstation

OS	Windows 11
GPU	NVIDIA RTX 5090 (32GB)
Runtime	Python Flask + CUDA
Port	internal
Tunnel	Cloudflare (images.*)
Startup	Windows Scheduled Task
FFmpeg	Shared build in PATH

File Transfer Protocol

All file transfers between VPS and 5090 use HTTP — no SCP/SSH required from the 5090 side. This avoids issues with SSH key access in Windows Scheduled Task contexts:

Fig 11 — File Transfer Architecture

VPS Express Token-based file server 5090 Flask urllib downloads GET /api/render-file/{token}/{idx} 5090 pulls input files from VPS PUT /api/render-upload/{token} 5090 pushes rendered video to VPS Render Token (crypto.randomBytes) TTL: 10 minutes | Maps idx → file path No auth required (token IS the auth)

11. Technology Stack

Frontend

Vanilla HTML/CSS/JS (single file)
~7000 lines in index.html
CSS custom properties for theming
HTML5 Drag & Drop API
Web Audio API for previews

Backend (VPS)

Node.js + Express
PM2 (cluster mode)
express-session
Google APIs (Drive, Veo 3)
FFmpeg (fallback render)

GPU Backend (5090)

Python Flask
PyTorch + CUDA
diffusers 0.37.0
ACE-Step, SDXL, Wan2.1
FFmpeg (primary render)

External APIs

ElevenLabs (TTS + SFX)
Google Veo 3 (video gen)
Google Gemini (LLM fallback)
Google Drive (storage)
Google Places (photo import)

12. Lessons Learned

SCP doesn't work reliably from Windows Scheduled Tasks — SSH key access differs between interactive and service contexts. HTTP-based file transfer solved this permanently.
FFmpeg filter_complex and Windows paths don't mix — The : in C: is a filter delimiter. Escaping with \: is required.
Base64 in JSON has ~33% overhead — Sending 10 photos as base64 turned a 30MB payload into 40MB+. Token-based download URLs reduced the submission payload to 6KB.
Cloudflare blocks Python urllib — The default Python-urllib/3.x user agent triggers Cloudflare's bot detection. A custom User-Agent header fixed 403 errors.
Design for offline-first — Making every feature work without the GPU from the start would have saved significant refactoring. The hybrid fallback pattern (try local → fallback to cloud) should be the default architecture, not an afterthought.

This is a private architecture document for Create Studio. Built by Jens Loeffler. Last updated March 2026.

#architecture #ai-video #gpu #distributed-systems #ffmpeg

Create Studio: Hybrid Cloud-GPU Architecture for AI Video Generation

1. System Overview

Key Design Principles

2. Request Flow & Routing

Session & Authentication

3. GPU Health Check & Service Discovery

4. Video Rendering Pipeline

4.1 FFmpeg Offload to 5090

Font Path Mapping

5. Hybrid Music Generation

6. AI Video Clip Generation

7. LLM Script Generation

8. Job Lifecycle & Data Model

9. NLE Timeline Editor

10. Deployment & Infrastructure

File Transfer Protocol

11. Technology Stack

12. Lessons Learned

1 Comment

Join the discussion

Create Studio: Hybrid Cloud-GPU Architecture for AI Video Generation

1. System Overview

Key Design Principles

2. Request Flow & Routing

Session & Authentication

3. GPU Health Check & Service Discovery

4. Video Rendering Pipeline

4.1 FFmpeg Offload to 5090

Font Path Mapping

5. Hybrid Music Generation

6. AI Video Clip Generation

7. LLM Script Generation

8. Job Lifecycle & Data Model

9. NLE Timeline Editor

10. Deployment & Infrastructure

File Transfer Protocol

11. Technology Stack

12. Lessons Learned

1 Comment

Join the discussion

Related Articles

Why I Traded One AI Agent for a Fleet, and What It Did to My Video

AI Velocity Is a Control Problem

The AI Production Crew: How Showspring Built The Doodle Cast Without a Studio