OpenClaw Factory: 33 autonomous Claude agents, 6 months of production
A field report from running OpenClaw Factory — a 33-agent Claude architecture in production for 6 months. Multi-agent system production lessons.

Six months ago we put a multi-agent Claude system into production and called it OpenClaw Factory. It runs the back office for four of our businesses — content, finance, code review, support triage, internal tooling — using thirty-three long
Six months ago we put a multi-agent Claude system into production and called it OpenClaw Factory. It runs the back office for four of our businesses — content, finance, code review, support triage, internal tooling — using thirty-three long-running agents, each with a narrow charter and a persistent memory store. The single biggest lesson from six months of multi-agent system production is that charter narrowness beats agent intelligence: thirty-three sharply-scoped agents outperform one brilliant one, every time, by a margin we did not expect. This is the field report — what the Claude agent architecture looks like, what we measured, what broke, and what we'd do differently if we were starting today.
We are not selling a product here. OpenClaw Factory is internal. The reason we're publishing this is that almost every public write-up on multi-agent systems is either a benchmark on a synthetic task or a demo that runs once. We've been running this thing in our actual back office for half a year, with the bills and bug reports to prove it.
Why 33 agents, not one
The case for a single agent is seductive. One context, one memory, one prompt to maintain. The case dies the moment you try to ship anything that touches more than two domains. A single agent that has to know about your Shopify accounting and your Postgres schema and your customer-support tone of voice runs out of context steering by turn twelve, and starts hallucinating the parts it can't keep in working memory. We tried this. It was bad.
What we replaced it with was a directed graph of agents, each with one job. The orchestrator (we call it the foreman) is a Sonnet 4.6 agent whose only job is to read inbound work and decide which specialist gets it. Specialists are mostly Sonnet, with Opus 4.7 reserved for planning and code review where the marginal token is worth it. Haiku 4.5 is the router for cheap classification.
The eleven-times figure is the one we get asked about. It's not throughput on raw work output, it's throughput on founder hours: how much of our calendar the factory bought back versus a baseline week with no agents. Two of those hours were taken back up by maintaining the factory itself. Net win, eleven hours weekly per business — which is why we run it across four businesses and still come out ahead.
The architecture, in one diagram and ten lines of TypeScript
Every agent in the factory has the same shape: a charter (system prompt), a tool surface, a memory store, and a delegation function. The memory store is a single Postgres table keyed by (agent_id, topic_id); the tool surface is a small registry; delegation is itself a tool call.
// src/factory/agent.ts — the only agent class.
type AgentSpec = {
id: string // 'foreman', 'finance-clerk', etc.
model: 'opus-4-7' | 'sonnet-4-6' | 'haiku-4-5'
charter: string // the system prompt; immutable per deploy
tools: ToolName[] // what this agent is allowed to call
delegatesTo: string[] // other agent ids this one may delegate to
memoryNamespace: string // Postgres key namespace
}
export async function runAgent(spec: AgentSpec, input: AgentInput) {
const memory = await loadMemory(spec.memoryNamespace, input.topicId)
const messages = [
{ role: 'system', content: spec.charter + '\n\n' + memory.systemTail() },
...memory.recentTurns(20),
{ role: 'user', content: input.prompt },
]
const tools = registry.materialize(spec.tools, {
delegate: (role, task) =>
orchestrator.enqueue({ from: spec.id, to: role, task, parent: input.id }),
})
const result = await claude.messages.create({
model: MODEL_MAP[spec.model],
system: messages[0].content,
messages: messages.slice(1),
tools,
max_tokens: spec.model === 'opus-4-7' ? 16_000 : 8_000,
})
await persistTurn(spec, input, result)
return result
}The orchestrator is its own small process — not a Claude agent, just a Postgres-backed queue with priority and concurrency limits. Agents enqueue delegations via the delegate tool; the orchestrator wakes the right specialist with the right input. There is no LangGraph here, no autogen, no MCP server orchestration layer. We tried two of those frameworks early; both added more failure modes than they removed. The factory is roughly twelve hundred lines of TypeScript, and we like that we can read all of it on a flight.
What lives in memory
Memory was the second-hardest design call after charter scoping. Every agent has a private memory namespace; nothing is global. The shape we landed on:
- Episodic — recent turns of conversation, capped at twenty, used for continuity within a topic.
- Reference — facts the agent learned about the world: account IDs, schema names, who-owns-what. Updated rarely, read often.
- Feedback — instructions the agent received about how to do its job. Read every turn, written when a human corrects the agent.
The split matters because it lets us truncate aggressively. Episodic memory is allowed to fall off the back of the window; reference and feedback are not. When a planner agent's plan starts drifting at turn 40, ninety percent of the time the cause is reference memory not making it into the system prompt because we got greedy with episodic.
What broke in production
Two failure modes accounted for roughly seventy percent of incidents in the first three months. The third month is when we built the checkpoints that mostly fixed them.
Failure 1: social maintenance burn
The first thing we noticed was that agents were polite to each other. The foreman would delegate a task with "could you take a look at this when you get a chance?" and the specialist would respond with "absolutely, getting on it now — let me know if you want me to focus on anything specific." Multiply this across a thousand-call day and you have an interesting Claude bill made of nothing but please-and-thank-you tokens.
The fix was to strip the inter-agent message format down to a JSON envelope. Agents talk to humans in prose. Agents talk to other agents in this:
type DelegationEnvelope = {
from: AgentId
to: AgentId
intent: 'execute' | 'verify' | 'plan' | 'classify'
payload: Record<string, unknown>
context_refs: string[] // pointers into shared memory, not inlined text
return_by: ISODate | null
}That alone cut inter-agent token spend by roughly forty percent and made transcripts auditable. The lesson generalized: agents do not need to talk to each other the way humans do, and they get worse at the work when you let them try.
Failure 2: planner drift
Opus 4.7 planners running multi-step jobs would, around turn 35-40, produce plans that no longer reflected the original task. Not hallucination exactly — more like a slow rewrite of the goal under the pressure of accumulated context. We patched it with a hard checkpoint: every ten turns, the planner is forced to re-read the original task and emit a one-paragraph "what I'm doing and why" summary. If the summary diverges from the task, the orchestrator rejects the plan and restarts the agent with a fresh window seeded only from reference memory.
// src/factory/checkpoints.ts
export async function planCheckpoint(agentId: string, turn: number) {
if (turn % 10 !== 0) return
const original = await getOriginalTask(agentId)
const summary = await runAgent(planner, {
id: `${agentId}-checkpoint-${turn}`,
prompt: `Restate the original task in one paragraph and confirm your current direction.`,
topicId: agentId,
})
const drift = await measureDrift(original, summary.content[0].text)
if (drift > 0.35) {
await orchestrator.restart(agentId, { seed: 'reference-only' })
}
}The drift score is a small embedding-cosine job; the threshold is hand-tuned. It's not elegant. It works.
What we measured
The metrics we pay attention to changed over the six months. We started with token spend and call latency, which are the obvious things, and we still watch them. But the metrics that actually predicted whether the factory was worth running were softer.
| Metric | Why it matters | Current |
|---|---|---|
| Founder hours bought back / week | The reason the factory exists | ~44 hours across 4 businesses |
| Reversal rate | How often a human had to undo an agent's action | 4.1% |
| Re-prompt rate | How often a human had to clarify before the agent finished | 11% |
| Memory bloat | Tokens in reference memory per agent | median 2.3K, p95 6.1K |
| Mean charter age | How long since the system prompt last changed | 71 days |
The reversal rate is the one we'd watch if we only had one. Below five percent it feels safe; above ten percent we'd shut things off and rewrite charters.
— our finance clerk agent, replying to itself in a turn-12 transcriptThe simplest explanation is usually right; if I'm reaching for a third-order tool call to explain a bookkeeping discrepancy, I'm probably wrong about the ledger.
We did not write that line. The agent did. We mention it because we caught ourselves quoting it in a planning meeting two weeks later. The factory is starting to surface its own intuitions, and they are sometimes better than ours. This is either the most interesting or most concerning thing we've observed in six months, depending on the day.
What we'd do differently if we started today
Three calls we'd reverse.
We'd start with three agents, not thirteen. We launched with thirteen and immediately tripped over inter-agent dependencies we hadn't designed for. The path that works is: ship one agent, watch it fail on real work for a week, split it into two agents along the line of failure, repeat. The graph that emerged was different from the one we'd designed on paper, in ways that paid off.
We'd build the observability layer first. For two months we were diagnosing factory issues by reading raw Anthropic console logs. The week we shipped a small internal dashboard that grouped traces by topic and agent, our incident response time dropped by an order of magnitude. If you build this thing without dashboards, you're flying blind through your own factory.
We'd be slower to add Opus. Opus 4.7 is a beautiful model. It is also four times the cost of Sonnet 4.6, and for ninety percent of the work in the factory the marginal Opus token does not move the outcome. Reserve Opus for planning and adversarial review; don't put it on the inner loop. We learned this the expensive way.
The path of least regret, if we were starting today, would be: one Sonnet agent with a five-paragraph charter, a memory table, and a single delegation tool. Then split when it bleeds. Then dashboard. Then maybe Opus for the planner. The thirty-three-agent factory is what that path looks like after six months — not a thing you should try to build on day one.
We have a second internal system that takes the opposite design path: thousands of cheap agents that exist only to debate and predict, never to execute. We'll publish a piece comparing the two later this spring; the architectural and philosophical differences turned out to matter more than we expected.
Three more from the log.

MiroFish vs OpenClaw Factory: simulation vs execution
Two open-source multi-agent systems, two different bets. MiroFish simulates societies; OpenClaw executes work. Here's what each is for, and why it matters.
Apr 18, 2026 · 11 min
Building Shadow Inbox: a Reddit + Hacker News buying-signals SaaS in one sprint
OpenClaw shipped a reddit monitoring tool with Claude as the engineering backend in a single sprint. The full ai built saas case study — scope, security audit, GTM, and the first numbers.
Mar 20, 2026 · 14 min
Claude Code vs Cursor vs v0: honest comparison after 6 months
I used all three every day for half a year across four businesses. Here's the claude code vs cursor verdict, the v0 vs cursor verdict, and what I wish I'd known.
Mar 15, 2026 · 9 min