MiroFish vs OpenClaw Factory: simulation vs execution
Two open-source multi-agent systems, two different bets. MiroFish simulates societies; OpenClaw executes work. Here's what each is for, and why it matters.

Two systems with the word "multi-agent" in their description. One you run to understand a system, the other you run to operate a system. In April 2026, after six months of watching both develop in public, the most useful frame I've found fo
Two systems with the word "multi-agent" in their description. One you run to understand a system, the other you run to operate a system. In April 2026, after six months of watching both develop in public, the most useful frame I've found for anyone thinking about agent-based architecture is the split between them — and the way the industry's current vocabulary papers over the split. The core insight from comparing MiroFish and OpenClaw Factory is that multi-agent systems split along a simulation-versus-execution axis that is not a spectrum but a fork: simulation frameworks optimize for interpretable emergent behavior across large populations, execution frameworks optimize for reliable task completion by a small number of specialized agents, and the two require incompatible design choices at every significant architectural decision — attempts to build one tool that does both produce systems that are mediocre at both. This is the comparison, the evidence, and the practical consequence for operators trying to decide what to build.
MiroFish, briefly
MiroFish is an open-source multi-agent simulation framework released in late 2025 by Guo Hangjiang and collaborators, published at github.com/666ghj/MiroFish under an AGPL-3.0 license. The project builds on OASIS, the social-simulation framework from CAMEL-AI, and extends it for larger-scale population modeling. Chen Tianqiao reportedly put roughly 30 million yuan (approximately $4M USD) into the effort to fund scaling experiments.
The system is designed to run simulations of LLM-driven agent populations in social environments — cities, markets, information networks. A typical MiroFish experiment might model 10,000 agents with individual personality configurations, economic endowments, and social graphs, and watch what happens when a perturbation (a rumor, a price change, a policy shift) propagates through the population.
The intellectual lineage is clear. MiroFish extends a line of work that includes Generative Agents from the 2023 Stanford/Google paper, OASIS itself, and various academic multi-agent frameworks. The goal is understanding — simulating complex adaptive systems at a scale where individual LLM agents produce emergent collective behavior that can be measured, compared against real-world data, and used to predict.
By March 2026, the main repo had crossed 17,000 GitHub stars. A community-maintained fork, nikmcfly/MiroFish-Offline, provides an offline variant that runs without the centralized simulation coordinator — useful for researchers in less-reliable network environments or for air-gapped studies.
OpenClaw Factory, briefly
OpenClaw Factory is the name we use internally for our multi-agent execution pattern. It's not a public open-source framework; it's a production architecture we've built and iterated on across four businesses. I've written about it at length in the 33-agent field report — the short version is that it runs long-lived specialist agents that do recurring operational work: research, draft generation, triage, data enrichment, follow-up. Thirty-three agents, across four businesses, running continuously, each with a narrow scope and a clear output contract.
The design goals are not simulation. They are:
- Reliability — the agent should do its job correctly most of the time, with a clear error mode when it fails.
- Observability — every agent action is logged, tagged, and searchable.
- Bounded scope — each agent has a short job description and cannot exceed it.
- Clear output contract — every agent produces output in a defined shape that downstream processes can consume.
- Cost predictability — each agent run has a predictable token cost, aggregated to a monthly budget.
These goals are different from MiroFish's. We don't care if the agents produce "emergent behavior" at population scale; we actively do not want that. We want boring, predictable, auditable work from a small number of highly specialized agents.
The axis that actually matters
The frame I use when clients ask about multi-agent architectures is: are you trying to understand a system, or execute work inside one?
If understanding: you're building a simulation. Your design must optimize for interpretable emergent behavior, controlled perturbations, measurable outcomes across populations, and reproducibility of experiments. You want many agents (thousands to millions), each relatively simple, observed collectively.
If execution: you're building a factory. Your design must optimize for reliable task completion, bounded scope, observability, error recovery, and cost per unit of work. You want few agents (ones to dozens), each highly specialized, observed individually.
These are incompatible goals at the architectural level. The simulation wants agents to behave richly and unexpectedly; the factory wants agents to behave predictably and boringly. The simulation wants loose coordination; the factory wants strict orchestration. The simulation wants exploration; the factory wants convergence.
A system designed to do both ends up with agents that are too constrained to produce interesting emergent behavior and too loose to produce reliable execution. The middle is empty.
Where MiroFish excels
MiroFish is excellent at what it's designed for. Several capabilities stand out.
Scale. Running 10,000-agent simulations is the headline feat. Most earlier frameworks (including the original OASIS it builds on) topped out at hundreds of agents before coordination overhead dominated. MiroFish's coordinator architecture handles orders of magnitude more.
Behavioral richness. Each simulated agent has personality parameters, economic endowments, social graph positions, and LLM-driven decision-making. The emergent behavior at the population level — rumor spread, market manipulation, policy response curves — is rich enough that published experiments are producing findings that map onto real-world sociological data.
Reproducibility. MiroFish's experimental protocol and seed management produce reproducible runs, which is the baseline for any scientific use. Run the same experiment with the same seed, get the same result. This is non-trivial in LLM-driven systems and is one of the framework's real contributions.
Interpretability tooling. Built-in visualizations, trace inspection, and intervention tooling let researchers dissect why an emergent behavior happened. This is the difference between "look, an interesting pattern" and "look, a measurable and explainable phenomenon."
The system is the right tool for policy research, macro-behavioral modeling, and the kind of large-scale social-systems study that has historically been done with agent-based modeling using simpler non-LLM agents.
Where MiroFish is not the right tool
Any production operation. The framework is optimized for running controlled experiments, not for executing work continuously. Cost per agent-hour is high; the error-recovery model is designed for experimental perturbation, not for production failure. Running MiroFish in production to get work done would be like running a physics simulator to calculate your taxes.
Small-N specialist tasks. When the goal is one agent doing one specific thing reliably, MiroFish is wildly over-engineered. The coordination layer adds latency and cost without benefit.
Low-cost recurring work. Each MiroFish agent costs real money per turn because each is LLM-backed. Running thousands of them in simulation is appropriate when the goal justifies the cost; running them to do work that a simple script or a single specialized agent could do is not.
Where OpenClaw Factory excels
The factory pattern, by design, is small-scale. Our thirty-three agents across four businesses is not a population; it's a staff. The analogue is a small team of specialist contractors, each with a narrow job, running continuously with clear escalation paths.
Reliability. Because each agent has a narrow scope, error rates are low and error modes are predictable. When something goes wrong, the failure is legible — a specific agent produced output that didn't match the contract, we can trace it and patch.
Cost efficiency. Each agent's monthly cost is known and predictable. A new agent's cost is estimated in advance and compared against the task it's replacing. We don't run agents that don't justify their monthly bill.
Observability. Every agent action flows through structured logging. A monthly review surfaces which agents are producing valuable work, which are stale, which are drifting. This is the operational visibility that makes the system trustworthy.
Integration. Each agent has clear input and output contracts, which means it composes cleanly with other production systems — CRMs, n8n workflows, deployment pipelines, monitoring. It's a citizen of the production environment, not a walled garden.
Where OpenClaw Factory is not the right tool
Understanding emergent behavior. Our agents do not interact with each other in a way that would produce interpretable emergent patterns. They are designed to be independent specialists. If you want to understand how a population of agents behaves collectively, this architecture tells you nothing.
Research experiments. The factory runs production work. It's not set up for controlled experimentation, reproducible runs, or intervention studies. Trying to use it for research would mean re-architecting most of it.
Anything population-scale. Thirty-three agents is the right size for executing a small business's work. Thirty-three thousand agents is not — the coordination model, the cost model, and the observability model all break past a certain size.
The stress-test that clarified this for me
Late 2025, we were refining the WondraKids landing page copy and wanted to stress-test it against different parent personas. The initial impulse was to use something MiroFish-like — simulate a population of 200 parents with varied demographics, feed them the landing page, measure response.
We built a small version of this. Ran 200 synthetic parents through the page. Got back a noisy blob of reactions that averaged out to "generally positive, some confused about the bundle price, a few lukewarm about the copy tone." The result was not actionable. It told us nothing we couldn't have guessed.
The same problem, reframed: run three specialist agents, each instructed to read the copy as a specific skeptical reader type — a budget-conscious parent, a design-skeptical parent, a copy-focused marketer — and return a concrete critique in a structured format. These three ran in parallel, each in isolation, each producing a focused list of objections.
The second approach surfaced six specific issues with the page within fifteen minutes. We fixed four of them. Conversion lifted ~8% on the next A/B test.
The difference between the two approaches is the difference between the two architectures. The first was an attempted simulation — trying to understand what a population of parents feels about the page. The noise overwhelmed the signal. The second was execution — having three focused specialists do a specific review. The signal was immediate.
Both are valid approaches in general. Which one is right depends on the question. If the question is "how does messaging perceptions shift across different population demographics under different market conditions," simulation wins. If the question is "what's wrong with this page that I should fix before Thursday," execution wins. Most operators' actual questions are the second kind.
Why the conflation is dangerous
The word "multi-agent" has become a marketing category. Investors fund it as a single thing. Startup pitches use it to describe wildly different architectures. Research papers and production tools get cited in the same sentence.
This conflation produces three bad outcomes.
Operators build the wrong thing. A founder who's read enough multi-agent marketing wants to build "a multi-agent system." They often choose the simulation-ish architecture because it sounds more impressive, then discover six months later that the system doesn't reliably do work. The right architecture for their problem was always the small-N specialist pattern.
Researchers misinterpret production results. Academic researchers writing about "multi-agent performance in production" cite results from what are actually execution-oriented factory patterns. The findings don't generalize to simulation-oriented frameworks, but the paper's phrasing suggests they do.
Vendors build tools that try to do both. A growing category of "multi-agent platform" tools promises to do both simulation and execution. These tools are compromised on both axes. They don't have the scale or reproducibility of a real simulation framework, and they don't have the reliability or observability of a real execution framework. They're in the middle, serving nobody well.
The clarifying question an operator should ask before building: am I trying to understand a system or do work inside one? If the answer is both, the tools are different, and they should probably be different systems.
— an operator considering multi-agent architecture for their agencyI was about to build a ten-agent simulation to forecast my sales pipeline. I ended up building one execution agent that scores each lead. The simulation would have been impressive and useless.
What this means for 2026
The multi-agent landscape in 2026 is bifurcating along this axis, whether the marketing language catches up or not. The visible signals:
Simulation frameworks are getting larger. MiroFish itself, the OASIS work at CAMEL-AI, and various academic groups are pushing population scales upward and adding more sophisticated perturbation/intervention tooling. These are research-grade tools getting better for research-grade use.
Execution frameworks are getting narrower. The factory patterns — OpenClaw and its cousins at other operators — are converging on small-N, highly-specialized, operationally-observable architectures. They're not trying to scale agent counts; they're trying to make each agent more reliable.
Hybrid tools are underperforming both. The platforms that try to do both — let you simulate and deploy multi-agent systems — are getting outflanked by specialists on each side. The ones we've tested are less capable than MiroFish for simulation and less reliable than a handcrafted factory for execution.
My prediction for the next twelve to eighteen months: the word "multi-agent" will split into two vocabularies. Simulation work will use terms like "agent-based modeling," "agent populations," "emergent behavior." Execution work will use terms like "agent factories," "specialist agents," "agent orchestration." The split is already underway in the practitioner community; it hasn't yet landed in the broader industry language.
Reading both projects side by side
For anyone exploring both, a quick orientation:
- Clone
666ghj/MiroFishand run one of the shipped example simulations — the rumor-propagation one is a good first read. Watch the emergent dynamics. This teaches you what a simulation system is for. - Read the OASIS paper (CAMEL-AI) to understand the intellectual lineage. MiroFish is extending OASIS, not replacing it.
- Watch a Claude Code session where a single specialist agent runs a six-step task end-to-end. Pay attention to the observability, the error recovery, the scope discipline. This teaches you what an execution system is for.
- Compare the two codebases directly. The MiroFish coordinator and agent definitions are elegant; the factory-pattern specialist-agent prompt files are terse and boring. The terseness is not a limitation; it's the design.
A practitioner who works with both will quickly see that the right mental model is not "multi-agent systems" but "simulation systems" versus "execution systems," and the frameworks at each end of the fork are wonderful in different ways.
The final frame
Software categories sometimes get named before they're understood, and the naming sticks long after the underlying reality has split. "Cloud computing" meant twenty things in 2011 and a handful of specific things in 2021. "Multi-agent" is in the 2011 phase right now.
MiroFish is the best open-source multi-agent simulation framework I've watched mature in 2025–26. The Chen Tianqiao investment, the GitHub star count, the active fork ecosystem (including the useful nikmcfly/MiroFish-Offline variant) all point to a project with real staying power in the simulation direction.
OpenClaw Factory is the internal name for a production execution pattern that is boring, predictable, and pays for itself every month. It is not a framework you can clone; it is an architecture you build per-business around your specific operational needs.
Calling both of them "multi-agent" is technically correct and practically misleading. If you are an operator deciding what to build: you are either running experiments or running work. The tools are different. The people who are succeeding in 2026 are the ones who picked the right fork of that road early and built accordingly.
Three more from the log.

Building Shadow Inbox: a Reddit + Hacker News buying-signals SaaS in one sprint
OpenClaw shipped a reddit monitoring tool with Claude as the engineering backend in a single sprint. The full ai built saas case study — scope, security audit, GTM, and the first numbers.
Mar 20, 2026 · 14 min
How to build a blog that ranks and gets cited by LLMs
SEO and LLM citation are different games that happen on the same page. Here's the llm seo blog pattern that wins both in 2026 — structure, voice, proof.
Apr 14, 2026 · 9 min
AI automation for DACH agencies: what actually works in 2026
Most ai automation dach agency promises collapse on contact with a Mittelstand buyer. Here's the pattern that actually ships — scope, pricing, delivery.
Mar 27, 2026 · 12 min