The content automation system that ships 1 billion views per month
A field report on the actual architecture behind a billion-view-per-month AI content pipeline — topic generation, Nano Banana Pro and Flux for stills, Kling 3 for image-to-video, Remotion and CapCut for assembly, LLM-as-judge for slop rejection, and the distribution layer that doesn't get accounts banned.

The system in this post pushed past one billion monthly views in late March. The number is a lagging measure of a pipeline we have been running and modifying for roughly fourteen months — through three model generations, two deeply painful
The system in this post pushed past one billion monthly views in late March. The number is a lagging measure of a pipeline we have been running and modifying for roughly fourteen months — through three model generations, two deeply painful platform-policy shifts, and one cohort of accounts that all got banned in a single weekend because we'd shipped a templated caption ending in the same emoji. The pipeline runs across Instagram Reels, TikTok, and YouTube Shorts, with a much smaller spillover into LinkedIn and X. It was built initially for a single fitness-transformation client whose images we'd been generating with Nano Banana Pro, then generalized into a six-vertical content engine that now operates as a separate book of business inside the studio. The honest version of this case study is that the AI tools did roughly half the work, the verification step did roughly a quarter, the distribution-layer engineering did the remaining quarter, and zero of it would have shipped without an operator who understood that "fully autonomous" is a marketing claim and "humans at three specific checkpoints" is how you actually run it.
The shape of the pipeline
The whole system is five stages plus a feedback loop. Each stage runs as an independent service with its own queue, its own retry behavior, and its own cost meter. Failures at any stage drop the asset, not the batch.
Stage 1: topic generation. A Claude Code job runs once a day per vertical and produces a list of 80–120 candidate topics. The job pulls from three sources: a curated subreddit list scoped to the vertical, the previous week's top-performing assets across all our accounts in the same vertical, and a small cache of "evergreen frames" we've manually written. The Claude prompt is unusually long — about 1,400 tokens — because it has to reject topics that overlap too closely with what we shipped in the last 14 days. Each surviving topic comes out with a one-line hook, a target visual concept, and a target duration in seconds.
Stage 2: visual generation. Each surviving topic gets a still image (or batch of stills, for sequence-style shorts) generated by either Nano Banana Pro or Flux Pro. The model choice is per-vertical and per-asset-type — Nano Banana Pro for anything that needs text in the frame, Flux for the photoreal verticals, Midjourney for stylized work where we have time for the slower API. The stills queue runs at maximum concurrency the providers will let us push without rate-limiting; in practice that's about 240 stills per hour aggregated across providers.
Stage 3: image-to-video. The stills feed Kling 3 for image-to-video, with each clip generated at 5 seconds of motion at 1080p. We default to Kling's "Standard" tier on cost grounds and only escalate to "Pro" tier when an asset is destined for a high-CPM vertical or has been flagged by the LLM judge as needing higher motion fidelity on a re-render. The full image-to-video pass is the slowest stage — single clips take 90–180 seconds even on Standard tier — so the pipeline is heavily parallelized, with each vertical getting its own Kling worker pool.
Stage 4: assembly. Two paths: the Remotion path for shorts where we need exact frame control (anything with on-screen text overlays, graph reveals, before/after pans) and the CapCut path for higher-velocity templated assembly (most lifestyle and educational verticals). The Remotion side runs on a small dedicated machine with hardware video encoding; the CapCut side runs through CapCut's API on a per-template basis. Both produce 9:16 1080x1920 MP4 outputs.
Stage 5: verification. Every assembled short is run through an LLM-judge pass before it gets queued for distribution. The judge — a Claude prompt with a fixed rubric — answers six questions per clip: is the hook visible in the first 1.2 seconds, is there any visual artifact in the first frame, does the caption match the visual content, is the audio level within a -3 to -9 dBFS window, does the topic overlap too heavily with another asset shipped in the last 7 days, and is there any AI-generated artifact (hands, text mangling, identity drift on a recurring character) that would tip a viewer. Anything that fails two or more checks goes to a human reviewer; anything that fails on the artifact check alone is auto-rejected and the asset is killed.
Distribution loop. Approved shorts land in a per-account scheduling queue. Each account has a posting cadence, a per-platform variant chain (the same short with different captions, hashtags, audio overlays per platform), and a kill switch tied to its account-health monitor. Performance data — views, watch-through, saves, follows-from-this-post — gets piped back to the Stage-1 topic generator on a 6-hour delay, which closes the feedback loop.
The operator-level details that don't appear in the architecture diagram
The diagram is the easy part. The reasons this pipeline is expensive to build but cheap to run live below the diagram, in the seams between stages.
The first detail is that each stage runs idempotently. If Kling silently returns a corrupted MP4 (which happens at roughly a 0.4% rate even on the Pro tier), the assembly stage detects the corruption on the first frame-extract pass and re-queues the clip, which means the failure never escalates to a human and the cost is one extra Kling call. Idempotency at every stage is the single architectural decision that lets the system run unattended overnight.
The second detail is that we treat the LLM-judge stage as the single most important component in the pipeline. The verification rubric is six questions long because we tried it with twelve, and the judge's accuracy collapsed; we tried it with three, and the judge passed too much slop. Six is the experimentally validated point at which the per-question accuracy stays above 92% and the aggregate decision matches a human reviewer about 88% of the time. The remaining 12% disagreement is what the human-reviewer queue exists to handle, and the queue's daily volume is the metric we use to know whether the rubric needs re-tuning.
The third detail is the slop rejection rate. The LLM judge kills somewhere between 14 and 22 percent of every batch outright. That number is high, deliberately. Early versions of the pipeline shipped most of what they generated, and the per-account performance suffered visibly within three weeks — saves dropped, watch-through dropped, the algorithm moved the accounts down. The economics of slop are not "free content is good"; the economics of slop are that one bad short reduces the next ten shorts' organic distribution on the same account. So we kill aggressively at the verification gate. The rejected assets are not regenerated. They are dropped, and the topic gets a cooldown stamp so the topic generator won't re-propose it for 14 days.
The fourth detail is the account warmup. New accounts spend 14 days posting nothing AI-generated. They post 2–3 manually-curated reposts a day, follow accounts in their target niche, react organically. After 14 days the account is "warm" and starts receiving pipeline content. We tried compressing the warmup to 7 days for one cohort of 12 accounts; 9 of the 12 hit a soft suppression signal within 30 days. The 14-day warmup costs about $40 per account in operator time and prevents an outcome that costs about $80 per account in lifetime ad-equivalent value lost.
The fifth detail is residential proxies. Every account has a dedicated residential IP from the geography of its bio. We rotate the IP only on account migration — never mid-session. The residential proxy stack is the single most boring infrastructure component and the one most likely to get a pipeline killed if it's run on cheap datacenter IPs.
What we learned generating fitness-transformation avatars
The first vertical the pipeline shipped was AI fitness-transformation content for a single client. The brief was straightforward: produce before/after avatars showing weight-loss outcomes in a way that read as visually credible without actually misrepresenting any specific outcome. The legal posture was that every avatar was clearly labeled as AI-generated and the client's marketing copy was written around aggregate outcomes, not individual transformations.
The visual problem was harder than it sounds. Early Nano Banana Pro generations produced "before" and "after" images that didn't read as the same person. The face structure drifted, the lighting differed, the wardrobe was incongruent. We solved it with a two-pass approach: generate the "before" first with full control over identity tokens, then use Nano Banana Pro's edit mode to produce the "after" by passing the "before" image as a reference and instructing the model to preserve facial identity while modifying body composition and pose. The two-pass approach took the identity-drift failure rate from about 35% on first-shot generations to about 4% on edit-mode generations.
The motion problem was easier. We pass the "before" still and the "after" still to Kling 3 as a two-frame sequence, ask it to produce a 5-second morph clip, and let Kling handle the in-between motion. Kling's interpolation is uncannily good at this specific transition. The first 1.2 seconds of the clip — the hook — is the "before" hold; the next 2.8 seconds is the morph; the final 1 second is the "after" hold. The whole thing reads as a visual transformation without ever claiming to depict a real person.
The volume out of this vertical is the largest in the whole pipeline. The fitness vertical accounts for roughly 38% of monthly shorts and a slightly higher share of monthly views, because the per-asset performance is exceptional in the relevant demographic.
What we learned generating Peptivo GLP-1 campaign assets
Peptivo is a different shape of problem. GLP-1 medications are a regulated category, and the platform-side moderation on weight-loss-medication content is aggressive and inconsistent across platforms. A short that lives happily on TikTok might be flagged on Instagram. A short that runs on YouTube Shorts might be silently deranked on TikTok.
The pipeline solution was to add a per-platform compliance pre-check between the assembly stage and the distribution stage. The compliance pre-check is another LLM-judge pass — separate from the quality judge — that scores each short against the moderation rubric of each target platform and either approves the short for distribution on that platform or rejects it. A short can be approved on YouTube Shorts and rejected on Instagram and the system will distribute accordingly.
The visual register for Peptivo content is intentionally calmer than the fitness-vertical work. The brief was educational explainer over visual storytelling, so the assets lean on Nano Banana Pro stills with overlay text rather than Kling motion. The verification rubric for this vertical was extended with a seventh question: does the caption avoid any specific clinical claim that we don't have substantiation for. The seventh question matters because the brand's legal counsel reviews the rubric, not individual shorts, and the rubric is what we operationally defend.
The Peptivo vertical accounts for roughly 11% of monthly shorts but a much higher share of paid-media usage downstream — the brand re-cuts the top-performing organic shorts as paid creative, which is a separate workflow we don't run but which loops performance data back to us.
— The most expensive failure mode is a banned account, not a bad short.We learned to build the account-health monitor before we built the next visual generation upgrade.
What we learned generating Trimrx imagery
Trimrx is the smallest vertical by volume and the one most useful for understanding what the pipeline cannot do. Trimrx is a wellness brand whose visual language is editorial — soft lighting, magazine-style composition, considered product photography. Most of the pipeline's optimizations — slop rejection at 18%, cost-per-short at $0.30, throughput at 31K shorts a month — are wrong for Trimrx, because Trimrx's bar for a usable asset is much higher than the algorithmic-content bar that drives the fitness vertical.
We ran Trimrx through the pipeline for two months and the slop rejection rate climbed to 41%. The remaining 59% was usable but not on-brand. The honest read on that experiment is that the pipeline is not the right tool for an editorial brand at low volumes — it is the right tool for a category that rewards velocity over composition. We moved Trimrx onto a separate, slower workflow with a human art director in the loop on every asset, and the conversion of pipeline shorts to Trimrx output went from 59% usable to 91% usable, at the cost of throughput dropping from 600 shorts a month to 80.
The lesson generalizes. The pipeline produces algorithmic-fit content, not brand-prestige content. The distinction matters and it does not show up in any of the metrics that algorithmic platforms surface.
What we learned wiring Claude Code into the topic-generation stage
The topic generator was the first stage to run on Claude Code rather than a one-off script, and the upgrade changed the shape of what topic generation could do. The earlier version was a deterministic prompt that ran against a fixed corpus and produced a fixed-length list. The Claude Code version is a small agent loop that can read the previous week's performance data, query the per-vertical subreddit corpus on demand, propose a draft topic, run a self-check against the 14-day overlap cache, revise the topic if the overlap check fails, and emit a final list with confidence scores per topic.
The agent-loop version produces topics whose first-week performance is roughly 22% higher than the deterministic-prompt version on every metric we measure. The cost difference is small — about $0.04 per topic generated, against $0.006 for the deterministic version — and the cost is amortized across thousands of downstream shorts that all benefit from a better hook.
We wrote the topic-generation Claude Code agent the same way we wrote the solo-founder agent stack — small role, narrow scope, idempotent, no write access outside its own bucket.
The phased rollout, in months
The pipeline did not arrive fully formed. The first version shipped 80 shorts a month and broke constantly. The current version ships 31,000. The rough timeline:
| Phase | Months | Throughput | Slop rate | Active accounts |
|---|---|---|---|---|
| 1: single-account proof | M1–M2 | ~80/mo | ~62% | 1 |
| 2: vertical expansion | M3–M5 | ~1.2K/mo | ~38% | 6 |
| 3: judge introduced | M6–M7 | ~3.4K/mo | ~24% | 12 |
| 4: account warmup formalized | M8–M9 | ~6.8K/mo | ~21% | 22 |
| 5: Kling 3 + Nano Banana Pro | M10–M11 | ~14K/mo | ~18% | 38 |
| 6: distribution-layer rebuild | M12–M14 | ~31K/mo | ~16% | 60 |
Two of these phases account for almost all of the volume gain: phase 3 (introducing the LLM judge, which made it economic to ship at higher throughput because slop stopped killing accounts) and phase 6 (rebuilding the distribution layer, which removed the operator-time bottleneck on how many accounts could be run in parallel).
The cost model that justifies the build
The unit economics matter more than the headline view count, because the pipeline only pays back if the cost-per-short and the value-per-view stay in the right ratio.
Cost-per-finished-short averages $0.30 across the full mix, with the cheapest verticals (text-overlay-style explainers) coming in around $0.18 and the most expensive (high-fidelity Kling Pro motion) coming in around $0.42. The cost line is dominated by Kling 3 video generation (about 58% of the per-short cost), with LLM verification (about 22%), still generation (about 12%), and distribution-layer infrastructure (about 8%) splitting the remainder.
Value-per-view varies wildly by vertical. The fitness vertical clears around $1.40 CPM on the brand's downstream paid spend; Peptivo clears closer to $3.20 CPM because the regulated-category buyer pool is smaller and more valuable; Trimrx clears almost nothing in the algorithmic-distribution channel because the brand's actual buyers don't shop on TikTok.
Aggregate, the pipeline pays back at roughly 5.4x revenue-per-dollar-of-cost across the mix. Most of that delta is concentrated in the two highest-paying verticals, and the lower-CPM verticals function as account-health insurance — they keep the accounts active, varied, and looking organic, which protects the high-CPM verticals from getting flagged.
What we'd do differently if we built this again
Three things, in priority order.
The first is to build the account-health monitor before the second vertical, not after the fifth. The cost of building it early is two engineer-weeks. The cost of not building it early was the 23-account banning weekend in November, which set the program back roughly six weeks.
The second is to invest in the LLM-judge rubric before any throughput optimization. We spent two months in phase 2 trying to push throughput from 1,200 to 3,000 shorts a month before introducing the judge in phase 3. The judge would have made the throughput push easier, cheaper, and less account-damaging if we'd built it first.
The third is to write the cost meter into every stage from day one. The current pipeline emits a per-asset cost record at every stage, which is how we know what the cost-per-short actually is. The earlier versions estimated cost from invoice-level rollups, which made it impossible to know which vertical was expensive and which was cheap. The per-asset meter took an afternoon to build retroactively and we should have built it on day one.
We've now generalized enough of this stack that we run it for three other operators in adjacent categories on a revenue-share basis, which is closer to a productized service than the traditional agency model the same team used to operate. The shape of the work is different. The infrastructure is the product.
Three more from the log.

How enterprises are quietly rebuilding themselves around AI coding
A field essay on what 'enterprise AI coding adoption' actually looks like in 2026 — past the headlines on Goldman Sachs, Stripe, Klarna, Shopify and the Microsoft Copilot Enterprise rollout. Who benefits, who gets displaced, what '10x faster' actually means, and what the next two years hold for the indie builders who watched this happen.
Apr 17, 2026 · 12 min
Replacing GemPages with a custom Shopify Horizon theme
We pulled GemPages off a Shopify store and rebuilt every landing page as Horizon sections. Theme customization, Liquid custom sections, and the numbers.
Jan 06, 2026 · 6 min
The Instantly.ai + v0 cold outreach stack
Instantly runs the sends. v0 builds the prototypes. Claude handles the research. The instantly ai v0 stack in full — inbox warming, deliverability, the lot.
Dec 22, 2025 · 9 min