The Multi-Agent Coordination Crisis
The Multi-Agent Coordination Crisis. CAID: Software Engineering Primitives for AI Teams. The Numbers: Why Coordination Beats Compute.
The Multi-Agent Coordination Crisis
Current multi-agent approaches are embarrassingly primitive. Most systems rely on prompt-based coordination: "Agent A, you handle the frontend. Agent B, work on the backend. Try not to break each other's code." This works about as well as you'd expect.
The problems compound quickly. Agents lack true isolation, so one agent's changes can corrupt another's context. Communication happens through unstructured natural language, creating telephone-game distortions. There's no systematic way to handle conflicts or ensure integration quality. The result is often worse than single-agent performance, despite 3-5x higher API costs [1].
CMU's research team, led by researchers studying software engineering automation, identified the core issue: multi-agent systems need the same coordination primitives that make human software teams functional. The solution isn't better prompts or smarter routing—it's git-native orchestration that mirrors proven development workflows.
CAID: Software Engineering Primitives for AI Teams
CAID works by decomposing complex tasks into dependency graphs, then delegating isolated subtasks to specialist agents who work in parallel git worktrees. Think of it as applying Scrum methodology to AI agents, but with the discipline of version control systems.
Here's how the workflow unfolds:
Task Decomposition: A manager agent analyzes the incoming request and builds a dependency graph of subtasks. Instead of vague natural language instructions, this creates structured, testable work units with clear interfaces and success criteria.
Isolated Delegation: Each subtask gets assigned to an engineer agent working in its own git worktree. This provides true isolation—agents can't accidentally overwrite each other's changes or corrupt shared state. The isolation is hard, not soft—no amount of prompting can substitute for actual filesystem separation [1].
Asynchronous Execution: Agents work in parallel on independent subtasks, with the dependency graph ensuring proper sequencing. This mirrors how human teams handle complex features: frontend and backend developers work simultaneously, then integrate at defined checkpoints.
Self-Verification: Each agent runs tests and validates its own work before submitting. This catches obvious errors early and reduces the integration burden on the manager agent.
Git-Native Integration: The manager agent handles merges using standard git workflows, with conflict resolution and final validation. This leverages decades of tooling and best practices from human software development.
The Numbers: Why Coordination Beats Compute
The performance gains are striking, especially for complex tasks that require sustained reasoning across multiple domains.
On PaperBench (reproducing machine learning papers from descriptions), CAID delivered massive improvements across model tiers:
- Claude 4.5: 57.2% → 63.3% (+26.7% relative)
- MiniMax 2.5: 10.4% → 36.7% (+253% relative)
- Weaker models saw the largest gains, with 3.5x relative improvement on average [2]
On Commit0-Lite (implementing features in Python libraries like tinydb), the gains were more modest but consistent:
- Claude 4.5: 53.1% → 59.1% (+14.3% relative)
- The pattern held across different model families and sizes [1]
The isolation mechanism proved critical. Hard git worktree isolation achieved 63.3% accuracy, while soft instruction-based isolation managed only 55.5%—actually worse than single-agent baselines. This suggests that true coordination requires infrastructure, not just prompting [1].
Costs increased 3-5x due to parallel agent execution, but there's no runtime speedup because git merges happen sequentially. The value proposition is accuracy and reliability, not speed.
Builder's Playbook: Implementing Git-Native Orchestration
For teams ready to implement these patterns, several practical principles emerge from the research and early production deployments.
Start with dependency mapping. Before spinning up multiple agents, invest in task decomposition tooling. JSON-structured communication protocols work better than natural language for coordination. The CMU team found that maintaining an AGENTS.md file documenting common patterns and failure modes improved success rates by 4% [3].
Optimize for 3-5 agents maximum. More agents don't linearly improve performance—coordination overhead grows quadratically. Redis research confirms that coordination costs negate parallelism benefits unless carefully managed [5].
Design for verification, not perfection. Each agent should validate its own work with tests and checks before submission. The manager agent can focus on integration rather than debugging individual components.
Use proven tools where possible. Git worktrees provide battle-tested isolation. Existing CI/CD pipelines can handle validation and testing. Tools like Conductor and Claude Code Web are emerging to handle orchestration logistics [3].
Tune the manager-engineer balance. Some teams benefit from engineer agents doing self-verification; others need manager agents to review all work. This depends on task complexity and model capabilities.
Case Study: From Paper to Production
The PaperBench results illustrate why coordination matters more than raw model intelligence. Reproducing ML papers requires sustained reasoning across multiple domains: understanding mathematical concepts, implementing algorithms, debugging code, and validating results.

Single agents, even powerful ones like Claude 4.5, struggle with this complexity. They might nail the algorithm implementation but miss subtle preprocessing steps. Or they'll get the math right but introduce bugs during code translation.
CAID's dependency graph approach breaks paper reproduction into natural subtasks: literature review, mathematical analysis, implementation, testing, and validation. Each agent can focus on its specialty while the manager ensures proper integration.
The 26.7% improvement on Claude 4.5 represents the difference between "sometimes works" and "reliably works" for complex software engineering tasks. That's the threshold where AI coding assistance becomes genuinely useful for production teams.
The Orchestration Stack: What's Next
The CAID research points toward a broader shift in AI development infrastructure. Just as human software teams evolved from cowboy coding to modern DevOps practices, AI agent teams need systematic coordination primitives.
AgentOrchestra and similar frameworks are emerging to provide hierarchical task execution and adaptive planning [6]. These tools treat orchestration as a first-class engineering discipline, not an afterthought.
The next wave will likely include:
- Standardized agent communication protocols (beyond JSON)
- Dependency-aware task scheduling with automatic parallelization
- Integration testing frameworks designed for multi-agent workflows
- Observability tools for debugging coordination failures
Nordic companies, with their strong software engineering culture, are well-positioned to lead this transition. The emphasis on systematic processes and quality engineering aligns naturally with git-native orchestration principles.
When AI Builds the Software: The Coordination Revolution
The deeper insight from CAID research isn't just about better multi-agent systems—it's about what changes when AI becomes the primary software builder rather than a coding assistant.
Human development teams evolved sophisticated coordination mechanisms because software complexity demands it. As AI agents take on larger portions of the development lifecycle, they'll need equally sophisticated coordination. Code generation is becoming commoditized; orchestration is the new differentiator.
This shift has profound implications for how we think about AI development tools. Instead of optimizing for single-agent performance on isolated tasks, we need infrastructure that enables reliable collaboration between specialized AI systems.
The teams that master this coordination layer first will build AI products that are qualitatively different—not just faster or cheaper, but capable of sustained reasoning across complex domains that single agents can't handle.
CAID proves that the path forward isn't more powerful models or clever prompts. It's applying proven engineering discipline to AI systems. Code is free. Coordination isn't.
Sources
- https://arxiv.org/abs/2603.21489
- https://bemiagent.com/agents/caid-what-cmu-learned-about-making-multiple-agents-code-together-without-breaking-everything
- https://addyosmani.com/blog/code-agent-orchestra
- https://www.linkedin.com/posts/omarsar_new-research-from-cmu-bookmark-this-one-activity-7444393288272379905-enu3
- https://redis.io/blog/multi-agent-systems-coordinated-ai
- https://arxiv.org/abs/2506.12508
- https://medium.com/design-bootcamp/running-multiple-ai-agents-at-once-using-git-worktrees-57759e001d7a
Want to go deeper?
We explore the frontier of AI-built software by actually building it. See what we're working on.