What is OpenClaw Factory?

Our internal multi-agent system: 33 long-running Claude agents organized as a directed graph of roles, each with a narrow charter, persistent memory, and the ability to delegate to other agents. It runs the back office for four of our businesses.

Why 33 agents and not one big one?

Because charter narrowness wins. Each agent has one job, one prompt, and one memory store. A single 'do everything' agent collapses under context bloat in about ten turns; thirty-three narrow agents split the work the way humans do.

What's the Claude agent architecture under the hood?

Claude Opus 4.7 for planners, Sonnet 4.6 for executors, Haiku 4.5 for routing. Memory is a Postgres table keyed by agent_id and topic. Delegation is a tool call: an agent calls a `delegate(role, task)` function, the orchestrator schedules it, results come back as tool results.

What broke first in production?

Two things. First: agents over-confirming with each other, burning tokens on social maintenance instead of work. Second: long-context drift in planner agents — by turn 40 the plan looked nothing like turn 5. We fixed both with hard checkpoints.

Should other teams build something like this?

Only if you have at least three workflows that recur weekly. Below that threshold you're better off with a thin Claude Code wrapper or n8n. We've written about that comparison separately.

← back to indexblog / agents / openclaw-factory-what-we-learned-running-33-autonomous-agents

● Agents

OpenClaw Factory: 33 autonomous Claude agents, 6 months of production

A field report from running OpenClaw Factory — a 33-agent Claude architecture in production for 6 months. Multi-agent system production lessons.

Arthur HofFounder, Bunny Honey Club AI

publishedMar 05, 2026

read7 min

Six months ago we put a multi-agent Claude system into production and called it OpenClaw Factory. It runs the back office for four of our businesses — content automation, finance, code review, support triage, internal tooling — using thirty-three long-running agents, each with a narrow charter and a persistent memory store. The single biggest lesson from six months of multi-agent system production is that charter narrowness beats agent intelligence: thirty-three sharply-scoped agents outperform one brilliant one, every time, by a margin we did not expect. This is the field report — what the Claude agent architecture looks like, what we measured, what broke, and what we'd do differently if we were starting today.

We are not selling a product here. OpenClaw Factory is internal. The reason we're publishing this is that almost every public write-up on multi-agent systems is either a benchmark on a synthetic task or a demo that runs once. We've been running this thing in our actual back office for half a year, with the bills and bug reports to prove it.

Why 33 agents, not one

The case for a single agent is seductive. One context, one memory, one prompt to maintain. The case dies the moment you try to ship anything that touches more than two domains. A single agent that has to know about your Shopify accounting and your Postgres schema and your customer-support tone of voice runs out of context steering by turn twelve, and starts hallucinating the parts it can't keep in working memory. We tried this. It was bad.

What we replaced it with was a directed graph of agents, each with one job. The orchestrator (we call it the foreman) is a Sonnet 4.6 agent whose only job is to read inbound work and decide which specialist gets it. Specialists are mostly Sonnet, with Opus 4.7 reserved for planning and code review where the marginal token is worth it. Haiku 4.5 is the router for cheap classification.

33agents in production

6 mouptime

~$2.4Kmonthly Claude spend

11×throughput vs. solo founder

The eleven-times figure is the one we get asked about. It's not throughput on raw work output, it's throughput on founder hours: how much of our calendar the factory bought back versus a baseline week with no agents. Two of those hours were taken back up by maintaining the factory itself. Net win, eleven hours weekly per business — which is why we run it across four businesses and still come out ahead.

The architecture, in one diagram and ten lines of TypeScript

Every agent in the factory has the same shape: a charter (system prompt), a tool surface, a memory store, and a delegation function. The memory store is a single Postgres table keyed by (agent_id, topic_id); the tool surface is a small registry; delegation is itself a tool call.

// src/factory/agent.ts — the only agent class.
type AgentSpec = {
  id: string                       // 'foreman', 'finance-clerk', etc.
  model: 'opus-4-7' | 'sonnet-4-6' | 'haiku-4-5'
  charter: string                  // the system prompt; immutable per deploy
  tools: ToolName[]                // what this agent is allowed to call
  delegatesTo: string[]            // other agent ids this one may delegate to
  memoryNamespace: string          // Postgres key namespace
}
 
export async function runAgent(spec: AgentSpec, input: AgentInput) {
  const memory = await loadMemory(spec.memoryNamespace, input.topicId)
  const messages = [
    { role: 'system', content: spec.charter + '\n\n' + memory.systemTail() },
    ...memory.recentTurns(20),
    { role: 'user', content: input.prompt },
  ]
  const tools = registry.materialize(spec.tools, {
    delegate: (role, task) =>
      orchestrator.enqueue({ from: spec.id, to: role, task, parent: input.id }),
  })
  const result = await claude.messages.create({
    model: MODEL_MAP[spec.model],
    system: messages[0].content,
    messages: messages.slice(1),
    tools,
    max_tokens: spec.model === 'opus-4-7' ? 16_000 : 8_000,
  })
  await persistTurn(spec, input, result)
  return result
}

The orchestrator is its own small process — not a Claude agent, just a Postgres-backed queue with priority and concurrency limits. Agents enqueue delegations via the delegate tool; the orchestrator wakes the right specialist with the right input. There is no LangGraph here, no autogen, no MCP server orchestration layer. We tried two of those frameworks early; both added more failure modes than they removed. The factory is roughly twelve hundred lines of TypeScript, and we like that we can read all of it on a flight.

What lives in memory

Memory was the second-hardest design call after charter scoping. Every agent has a private memory namespace; nothing is global. The shape we landed on:

Episodic — recent turns of conversation, capped at twenty, used for continuity within a topic.
Reference — facts the agent learned about the world: account IDs, schema names, who-owns-what. Updated rarely, read often.
Feedback — instructions the agent received about how to do its job. Read every turn, written when a human corrects the agent.

The split matters because it lets us truncate aggressively. Episodic memory is allowed to fall off the back of the window; reference and feedback are not. When a planner agent's plan starts drifting at turn 40, ninety percent of the time the cause is reference memory not making it into the system prompt because we got greedy with episodic.

What broke in production

Two failure modes accounted for roughly seventy percent of incidents in the first three months. The third month is when we built the checkpoints that mostly fixed them.

The first thing we noticed was that agents were polite to each other. The foreman would delegate a task with "could you take a look at this when you get a chance?" and the specialist would respond with "absolutely, getting on it now — let me know if you want me to focus on anything specific." Multiply this across a thousand-call day and you have an interesting Claude bill made of nothing but please-and-thank-you tokens.

The fix was to strip the inter-agent message format down to a JSON envelope. Agents talk to humans in prose. Agents talk to other agents in this:

type DelegationEnvelope = {
  from: AgentId
  to: AgentId
  intent: 'execute' | 'verify' | 'plan' | 'classify'
  payload: Record<string, unknown>
  context_refs: string[]   // pointers into shared memory, not inlined text
  return_by: ISODate | null
}

That alone cut inter-agent token spend by roughly forty percent and made transcripts auditable. The lesson generalized: agents do not need to talk to each other the way humans do, and they get worse at the work when you let them try.

Failure 2: planner drift

Opus 4.7 planners running multi-step jobs would, around turn 35-40, produce plans that no longer reflected the original task. Not hallucination exactly — more like a slow rewrite of the goal under the pressure of accumulated context. We patched it with a hard checkpoint: every ten turns, the planner is forced to re-read the original task and emit a one-paragraph "what I'm doing and why" summary. If the summary diverges from the task, the orchestrator rejects the plan and restarts the agent with a fresh window seeded only from reference memory.

// src/factory/checkpoints.ts
export async function planCheckpoint(agentId: string, turn: number) {
  if (turn % 10 !== 0) return
  const original = await getOriginalTask(agentId)
  const summary = await runAgent(planner, {
    id: `${agentId}-checkpoint-${turn}`,
    prompt: `Restate the original task in one paragraph and confirm your current direction.`,
    topicId: agentId,
  })
  const drift = await measureDrift(original, summary.content[0].text)
  if (drift > 0.35) {
    await orchestrator.restart(agentId, { seed: 'reference-only' })
  }
}

The drift score is a small embedding-cosine job; the threshold is hand-tuned. It's not elegant. It works.

What we measured

The metrics we pay attention to changed over the six months. We started with token spend and call latency, which are the obvious things, and we still watch them. But the metrics that actually predicted whether the factory was worth running were softer.

Metric	Why it matters	Current
Founder hours bought back / week	The reason the factory exists	~44 hours across 4 businesses
Reversal rate	How often a human had to undo an agent's action	4.1%
Re-prompt rate	How often a human had to clarify before the agent finished	11%
Memory bloat	Tokens in reference memory per agent	median 2.3K, p95 6.1K
Mean charter age	How long since the system prompt last changed	71 days

The reversal rate is the one we'd watch if we only had one. Below five percent it feels safe; above ten percent we'd shut things off and rewrite charters.

The simplest explanation is usually right; if I'm reaching for a third-order tool call to explain a bookkeeping discrepancy, I'm probably wrong about the ledger.
— our finance clerk agent, replying to itself in a turn-12 transcript

We did not write that line. The agent did. We mention it because we caught ourselves quoting it in a planning meeting two weeks later. The factory is starting to surface its own intuitions, and they are sometimes better than ours. This is either the most interesting or most concerning thing we've observed in six months, depending on the day.

What we'd do differently if we started today

Three calls we'd reverse.

We'd start with three agents, not thirteen. We launched with thirteen and immediately tripped over inter-agent dependencies we hadn't designed for. The path that works is: ship one agent, watch it fail on real work for a week, split it into two agents along the line of failure, repeat. The graph that emerged was different from the one we'd designed on paper, in ways that paid off.

We'd build the observability layer first. For two months we were diagnosing factory issues by reading raw Anthropic console logs. The week we shipped a small internal dashboard that grouped traces by topic and agent, our incident response time dropped by an order of magnitude. If you build this thing without dashboards, you're flying blind through your own factory.

We'd be slower to add Opus. Opus 4.7 is a beautiful model. It is also four times the cost of Sonnet 4.6, and for ninety percent of the work in the factory the marginal Opus token does not move the outcome. Reserve Opus for planning and adversarial review; don't put it on the inner loop. We learned this the expensive way.

The path of least regret, if we were starting today, would be: one Sonnet agent with a five-paragraph charter, a memory table, and a single delegation tool. Then split when it bleeds. Then dashboard. Then maybe Opus for the planner. The thirty-three-agent factory is what that path looks like after six months — not a thing you should try to build on day one.

We have a second internal system that takes the opposite design path: thousands of cheap agents that exist only to debate and predict, never to execute. We'll publish a piece comparing the two later this spring; the architectural and philosophical differences turned out to matter more than we expected.

— filed underAgents Engineering Claude openclaw

— share

x in tg

— keep reading

Three more from the log.

001 · Claude

Claude Code vs Codex CLI vs Gemini CLI: 2026 ranking

Claude Code, Codex CLI, and Gemini CLI ranked on real Terminal-Bench data, pricing, and 6 months of daily operator use across 4 businesses.

Jul 03, 2026 · 7 min

002 · Claude

Building iOS apps with Claude Code: idea to ship in 2026

Reddit for idea validation, Claude Code + Xcode 26 for the build. The 2026 iOS app pipeline — plus the .pbxproj rule that saves you a week.

Jun 28, 2026 · 14 min

003 · AI

MiroFish vs OpenClaw Factory: simulation vs execution

Two open-source multi-agent systems, two different bets. MiroFish simulates societies; OpenClaw executes work. Here's what each is for, and why it matters.

Apr 18, 2026 · 11 min