Everything you need to know about SET

Autonomous multi-agent orchestration for Claude Code. Give it a spec, get merged features. Here's how and why.

200+ E2E runs 376 LOC 720+ agent hours
?

General

What is SET?

SET is an orchestration framework that transforms a product specification into fully implemented, tested, and merged code — autonomously.

You write a detailed markdown spec (data model, pages, design tokens, auth flows, seed data). SET decomposes it into independent changes, dispatches parallel Claude Code agents into isolated git worktrees, runs deterministic quality gates, and merges the results.

Spec
Markdown input
Decompose
DAG of changes
Dispatch
Parallel agents
Monitor
15s poll cycle
Verify
Quality gates
Merge
Conflict resolution
Replan
Cover gaps

Who is SET for?

Development teams and technical leaders who already use Claude Code and want to scale beyond single-agent, single-task workflows. If you're an architect managing a backlog of well-specified features, or a CTO evaluating how AI agents can own the full implementation-to-merge cycle, SET is the layer that makes that possible.

It assumes you can write a good spec — it handles everything after that.

Is this production-ready?

SET was built with itself over 79 days. These are the numbers:

1,500+
Commits
376
Capability Specs
720+
Agent Hours
200+
E2E Runs

The MiniShop benchmark delivers 6/6 changes merged, zero human intervention, in 1h 45m. CraftBrew (15 changes, 150+ files, 28 DB models) completed fully autonomously in ~6h.

How Is SET Different From...

...just using Claude Code?

Claude Code in 2026 is dramatically capable: native worktrees, Agent Teams (experimental, 3-5 agents), 26 hook events with 4 handler types, auto-memory, Plan mode, subagents with worktree isolation, Agent SDK, and MCP. SET is built on top of all these primitives.

Claude Code (alone)SET
ScopeOne task or one team sessionFull spec → decomposed into N parallel changes
PlanningPlan mode (freeform, ephemeral, read-only)OpenSpec: persistent versioned artifacts with traceability
QualityHooks can run checks (DIY wiring)Structured gate pipeline: build → test → E2E → review
MergingManual git merge, no enforcementAutomated merge queue + conflict resolution + post-merge verification
RecoverySession dies, restart manuallySentinel detects crash in 30s, graduated escalation
MemoryAuto-memory (flat file at startup)Semantic memory graph with topic recall at 4 lifecycle points
StateSession-scoped, lost on restartAtomic JSON, resumable across restarts
CoordinationAgent Teams: one session, shared task listCross-session, cross-machine orchestration with sentinel

Claude Code gives you excellent building blocks. SET gives you the assembled machine. You could build this yourself with the Agent SDK and hooks — SET is the battle-tested implementation.

What SET doesn't have: Claude Code's 101+ plugin marketplace, Agent SDK for custom development, deep IDE integration (VS Code, JetBrains).

...Claude Code Agent Teams?

Agent Teams = parallelism within one session. A lead assigns subtasks to teammates (3-5 recommended); they share context via task list + mailbox. Important: teammates share the working directory by default — two teammates editing the same file leads to overwrites. Worktree isolation available via subagent config but not automatic. Still experimental: no session resumption, task status lag, one team per session, lead is fixed.

SET = parallelism across sessions, machines, and time. A planner decomposes a full spec into a dependency DAG, dispatches each to its own long-running agent, manages quality gates and the merge pipeline. Good for shipping an entire product.

They're complementary. SET can use Agent Teams inside each worktree while managing cross-change orchestration externally.

What Agent Teams does better: Zero-setup parallelism. No framework installation needed — one environment variable starts a team. For a quick parallel task within a single feature, Agent Teams is faster to reach for.

...Cursor's parallel agents?

Cursor 3 (April 2026) has two parallelism modes: local worktree agents (up to 8, via git worktree add) and cloud Background Agents (no cap, credit-bound, each in an isolated AWS Ubuntu VM). You can close your laptop with cloud agents.

What Cursor lacks vs SET:

  • No spec decomposition — agents launched from ad-hoc prompts, no dependency ordering
  • ~30% PR merge rate — Cursor's own published stat: ~30% of generated PRs pass CI and merge without intervention. SET MiniShop: 100%.
  • No inter-agent coordination — multiple agents have no awareness of each other
  • No supervision — no sentinel, no crash recovery, no stall detection

What Cursor does better: Cloud execution (agents work while you sleep), local worktree agents with zero framework setup, polished IDE, multi-model support, CI auto-fix cookbook. Cost caveat: cloud agents ~$5-15 per PR, users report $2000+ in two days with heavy use.

...Devin?

Devin is an autonomous AI engineer — takes a task, works in a sandboxed VM, creates PRs. Can run multiple concurrent sessions (each in its own VM), but sessions are independent with no coordination between them.

DevinSET
ExecutionCloud VM sandboxLocal worktrees
ParallelismIndependent sessions (no coordination)Coordinated parallel via orchestrator + merge queue
TestingRuns tests if they exist (ad-hoc)Structured gate pipeline (build → test → E2E → review)
IntegrationsExcellent Slack/Jira/GitHubCLI + web dashboard + MCP
MergeOpens PR, relies on CIIntegration gates enforced before merge

What Devin does better: Slack integration is best-in-class — assign a task from Slack, get a PR back. Cloud VM means zero local setup. The UI for watching agent work is polished. For simple, independent tasks (migrations, CRUD, test writing), Devin is smoother than setting up SET orchestration.

What SET does better: Multi-change coordination, spec traceability, pre-merge quality gates, sentinel supervision, persistent memory, and deterministic merge ordering.

...Kiro (Amazon)?

Kiro (GA Nov 2025) is the closest philosophical match: a spec-driven IDE with formal EARS requirements, design docs, and task lists. Built on VS Code, powered by Bedrock. Supports Claude, DeepSeek, Qwen, MiniMax via auto-router.

Kiro's genuine innovations:

  • EARS spec notation — formal SHALL statements with Requirements-First or Design-First entry points. Specs stay synced with code.
  • Property-Based Testing — extracts testable properties from specs, generates hundreds of random inputs, shrinks to find minimal failing cases. Auto-fixes. Genuinely novel.
  • 10 hook trigger types — File Create/Save/Delete, Prompt Submit, Agent Stop, Pre/Post Tool Use, Pre/Post Task Execution, Manual.
  • Autonomous Agent (preview) — background agent with 3 sub-agents (planner, writer, verifier). Up to 10 concurrent tasks across repos. Opens PRs, never merges. Learns from code review.
  • Multi-model — Claude, DeepSeek, Qwen, auto-router. Pricing: Free→$20→$40→$200/mo.

The differences:

  • Kiro's Autonomous Agent handles 10 tasks but opens PRs without merging. SET manages the full merge pipeline with gates.
  • Kiro has PBT (random test generation from spec properties). SET has deterministic gates (exit codes).
  • Kiro is multi-model. SET is Claude-only.

What Kiro does better: PBT is genuinely novel, Autonomous Agent handles 10 concurrent tasks, multi-model support, lower barrier to entry, 10 hook trigger types. What SET does better: Spec decomposition into DAGs, coordinated merge with gates, sentinel supervision, semantic memory, design integration.

...Augment Intent?

Augment Intent (public beta, Feb 2026, macOS only) is architecturally the most similar tool to SET:

  • Living Specifications — self-maintaining spec docs that auto-update as agents work. Changes propagate to active agents.
  • Coordinator/Specialist/Verifier — 6 specialist personas (Investigate, Implement, Verify, Critique, Debug, Code Review).
  • Git worktree isolation — each task creates a "Space" with its own branch and worktree.
  • Multi-model — runs Claude Code, Codex, OpenCode. Mix models per task.
  • No agent cap — "Run as many agents as the task needs."

What Augment does better: Multi-model mixing (Opus for planning, Sonnet for coding), living specs that auto-update, specialist agent personas, polished desktop UX.

What SET does better: Deterministic gates (exit codes, not agent judgment), proven production track record (200+ runs), Linux support, web dashboard, design integration, persistent semantic memory, full merge pipeline. Augment is macOS-only beta; SET is battle-tested in production.

...Roo Code, Aider, Cline, Windsurf?

All excellent single-agent tools — each with genuine strengths SET lacks:

  • Roo Code — Configurable modes (Architect/Code/Debug/custom), sequential delegation ("Boomerang" pattern). Model-agnostic. Better at: easy custom mode creation, any-LLM support, open source community.
  • Aider — CLI pair programmer. Better at: best-in-class git integration (auto-commit with meaningful messages, full undo), any-model support, repo map (tree-sitter), cost-efficient token usage, edit format innovation.
  • Cline — VS Code extension. Better at: best-in-class MCP marketplace, full transparency (every tool call visible), any-model support, granular approval workflow.
  • Windsurf — AI IDE (acquired by OpenAI ~$3B). Cascade engine had strong within-session context tracking. Current status post-acquisition uncertain.

SET is not a better version of these tools — it's a different category. These are the developer's hands. SET is the sprint board, CI pipeline, and release manager.

...Copilot Coding Agent, OpenHands, Composio?

  • GitHub Copilot Coding Agent — Assign a GitHub Issue, Copilot creates a branch, codes, runs CI, self-reviews, opens a PR. Cloud-hosted. Better at: zero-setup for GitHub users, largest distribution, GitHub-native workflow. Lacks: no multi-agent coordination (each agent independent), no spec decomposition, no pre-merge gate pipeline.
  • OpenHands — Strongest open-source single-agent runtime. Docker-sandboxed, multi-model, strong SWE-bench results (50%+). Provides agent execution, not orchestration workflow. No parallel coordination, no gates, no merge pipeline.
  • ComposioCorrection: Composio is a tool-integration platform (250+ API integrations for agents), NOT an agent orchestrator. It provides middleware for CrewAI, LangGraph, etc. to call external tools. Different category from SET.
  • GPT-Engineer / Lovable — App builders for non-developers. Prompt to MVP. Different category entirely.

Capability matrix: SET vs. the landscape

Tool Parallel Agents Isolation Specs Gates Merge Pipeline Supervisor Cloud Any LLM
SET Worktrees OpenSpec 9 gates Sentinel Claude
Augment Intent Coordinator Spaces Living specs Verifier agent N/A
Claude Code Experimental Subagents Hooks (DIY) Claude
Cursor 8 local + cloud WT + VMs ~30% merge Multi
Devin Independent Sandbox VM Ad-hoc tests Proprietary
Kiro 10 tasks (preview) EARS + PBT PBT + hooks Opens PRs Auto-router
Copilot Agent Independent Cloud VM CI + self-review GPT/Claude
Roo Code Modes
Aider
Cline
OpenHands Docker

SET's unique position: the combination of structured specs + deterministic gates + merge pipeline + sentinel. Other tools excel where SET doesn't: cloud execution (Cursor, Devin, Copilot), model flexibility (Aider, Kiro, Cline), living specs (Augment Intent), PBT (Kiro), IDE integration (Kiro, Cursor).

OpenSpec

What is OpenSpec and why not just use a prompt?

OpenSpec is a structured, artifact-driven methodology. Instead of a conversation, work is expressed as a sequence of structured documents that serve as contracts between planner, implementer, and verifier:

  1. Proposal — Why we're doing this (problem, impact)
  2. Specs — What exactly must be built (WHEN/THEN acceptance criteria)
  3. Design — How we'll build it (decisions, tradeoffs)
  4. Tasks — Implementation checklist ([REQ: requirement-name] traceability)

Why not just a prompt?

  • Prompts drift. Agents interpret, improvise, skip. Specs have explicit IN SCOPE / OUT OF SCOPE.
  • Prompts can't be verified. How do you check "build a webshop"? OpenSpec checks every requirement against tasks against code.
  • Prompts don't compose. 5 parallel agents need divided scope. Delta specs assign specific requirements to specific changes.
  • Prompts leave no record. OpenSpec archives the full decision chain for future reference.

How is this different from Claude Code's Plan mode?

Plan mode is a thinking step. OpenSpec is a workflow system.

Plan ModeOpenSpec
OutputFreeform textStructured artifacts (proposal, specs, design, tasks)
PersistenceDisappears after sessionCommitted to repo, archived after completion
TraceabilityNoneEvery task traces to a requirement
VerificationNoneAutomated: completeness, correctness, coherence
ScopeTrustExplicit IN SCOPE / OUT OF SCOPE
Multi-agentNot designed for itDelta specs assign scoped work to each agent

Plan mode helps a single agent think. OpenSpec gives a system of agents structured contracts to work against and verify.

What are delta specs?

When a change is created (e.g., add-user-auth), its spec files are delta specifications — the incremental requirements this change introduces, using ADDED / MODIFIED / REMOVED markers.

After merge, delta specs sync into main specs — the single source of truth. This means:

  • Each change only describes what it changes, not the entire system
  • Multiple changes can touch the same capability without conflicting
  • Main specs evolve incrementally as changes merge
  • Full history preserved in archived changes

What does the artifact workflow look like?

Explore
Think (read-only)
Proposal
Why
Specs
What (WHEN/THEN)
Design
How
Tasks
Checklist
Apply
Implement
Verify
Check
Archive
Preserve

Each artifact depends on the previous. The schema enforces ordering — you can't create tasks before design, because design decisions inform task structure.

Fast-track: /opsx:ff generates all artifacts in one pass.

Orchestration

How does parallel execution actually work?

  1. Decompose — Planner reads your spec, creates a dependency DAG of independent changes
  2. Dispatch — For each ready change: create worktree, generate context, bootstrap env, start Ralph Loop
  3. Monitor — Every 15 seconds: check progress, detect stalls, track budgets
  4. Verify — Agent reports "done" → run gate pipeline (build → test → E2E → review)
  5. Merge — Sequential merge queue with conflict resolution and post-merge verification
  6. Sync — After each merge, all running worktrees pull main immediately
  7. Replan — After all changes merge, check for uncovered requirements, generate new changes

Why git worktrees?

True filesystem isolation without the overhead of cloning:

  • Each agent has its own working directory — no file conflicts during parallel development
  • Each agent has its own branch — clean, independent git history
  • Worktrees share the same .git directory — no disk waste from full clones
  • Independent dep installs, test runs, and builds — no interference

This is fundamentally different from agents "coordinating" via messages in a shared workspace — that approach breaks down when agents edit the same files simultaneously.

What happens when agents conflict?

Multi-layer conflict resolution:

  1. Preventive — Dependency DAG orders cross-cutting changes sequentially. Profile-defined cross-cutting files are serialized.
  2. Generated files — Lockfiles, build artifacts auto-resolved, then regenerated (pnpm install).
  3. Real conflicts — Source code conflicts cause merge-blocked. Sentinel investigates, redispatches, or escalates.
  4. Post-merge sync — All running worktrees pull main immediately after every merge.

In practice: CraftBrew (15 changes) had 4 conflicts — all auto-resolved. MiniShop (6 changes): zero conflicts.

What is the sentinel?

An AI supervisor that watches orchestration and handles what goes wrong. Separate agent from the orchestrator — supervisor/subordinate pattern.

EventSentinel Action
Agent crashDiagnose from logs, restart or escalate
Agent stall (>120s)Investigate cause, attempt recovery
Periodic checkpointAuto-approve (routine) or escalate (unusual)
Orchestration completeGenerate summary report
Budget overrunPause agent, escalate

Cost: typically 5-10 LLM calls per entire run. Saves hours of wasted compute by catching crashes that would otherwise silently waste an overnight run.

Quality & Verification

What are integration gates?

Deterministic quality checks before merging. Exit codes, not LLM judgment.

GateWhatHow
buildTypes check, code compilestsc --noEmit, next build
testUnit/integration testsvitest run, pytest
e2eBrowser testsplaywright test
scope_checkFiles match scopeChanged files validated against declared scope
test_filesTests presentTest files exist for implemented code
reviewCode quality, securityClaude review — no CRITICAL findings
rulesCustom complianceProfile-defined rules (naming, patterns)
spec_verifyRequirements addressedAll REQ-IDs have tasks
smokePost-merge sanityCustom command (runs after merge)

If a gate fails, the agent receives the error and retries. Self-healing. No human needed.

Why not just trust the LLM's judgment?

Because LLMs hallucinate confidence. "Looks good to me" from a code review is not the same as vitest run returning exit code 0.

MiniShop's 5 gate retries — all self-healed from real bugs an LLM review would have missed:

  1. Missing test file → test gate caught it
  2. Jest config import error → build gate caught it
  3. Playwright auth test failures ×3 → agent fixed to match actual behavior
  4. Post-merge type mismatch → agent synced main
  5. Cart test race condition → agent added waitForSelector

An LLM review would have said "looks good" for at least 3 of these.

How do you measure output quality across runs?

Structural convergence. Run the same spec twice independently and measure similarity:

83/100
MiniShop convergence
83-87%
Range across projects
100%
Schema equivalence
100%
Convention compliance

Remaining divergence is stylistic (variable naming, CSS order), not structural. The spec + template system produces deterministic architecture even with non-deterministic LLMs.

𝗔

Memory & Learning

How does persistent memory work?

Hook-driven memory (shodh-memory) captures and injects context automatically. Agents don't need to save explicitly.

HookWhenWhat
WarmstartSession startLoads relevant memories as context
Pre-toolBefore each tool callInjects topic-based recall
Post-toolAfter Read/BashSurfaces past experience
SaveSession endExtracts new insights from conversation

Key finding: zero voluntary saves across 15+ sessions. Agents don't save on their own — the hook infrastructure is essential.

Why does memory matter for orchestration?

Without memory, every agent rediscovers conventions, repeats mistakes, wastes tokens.

+34%
Convention compliance (CraftBazaar)

Learnings from failed runs convert to rules, enforced in the next run. The system improves with every orchestration.

Architecture & Extensibility

What is the plugin system?

Three layers separate concerns:

  • Layer 1 — Core (lib/set_orch/): Abstract orchestration. Dispatcher, monitor, merger, gates. No project-specific logic.
  • Layer 2 — Modules (modules/): Project-type knowledge. modules/web/ knows Next.js, Playwright, Prisma.
  • Layer 3 — External: Your own plugins via pip install + entry_points. set-project-fintech could add IDOR scanning, PCI compliance.

Each module implements the ProjectType ABC: test detection, forbidden patterns, verification rules, custom gates, merge strategies, planning rules. New project types don't touch core — they extend it.

Can I use this without Claude Code?

No. SET is built specifically for Claude Code: worktrees, hooks, MCP, skills, subagents. This is by design — SET doesn't abstract over LLMs. It leverages Claude's strengths fully: 200K+ context, native tool use, code understanding.

Depth beats breadth. Abstracting to a lowest-common-denominator API would sacrifice these capabilities for theoretical portability.

Can this run on-premise?

The infrastructure is designed for it. SET is self-hosted — no SaaS dependency. The orchestration engine, gates, and state management have no cloud dependency. Only the LLM endpoint needs configuration.

When on-premise Claude models become available for regulated industries (banks, defense, government), SET's architecture works unchanged.

How does design system integration work?

  1. Export from design tool → design-system.md (tokens) + design-brief.md (visual specs)
  2. Dispatcher scope-matches relevant pages to each change → per-change design.md
  3. Agent receives exact hex colors, font names, component layouts
  4. Review gate checks design compliance — token mismatches flagged

Eliminates the "shadcn defaults everywhere" problem. Agents implement your brand, not a generic component library.

!

What SET Doesn't Do (Yet)

Honest gaps where competitors are ahead

GapWho does it betterNotes
Cloud executionCursor BGA, Devin, CopilotSET requires a running local machine. Cloud agents work while you sleep.
Model flexibilityAider, Roo Code, ClineSET is Claude-only. No GPT, Gemini, or local model support.
IDE integrationKiro, Cursor, WindsurfSET is CLI + web dashboard. No VS Code/JetBrains plugin.
Zero-setupCopilot, Cline, CursorSET requires pip install, project init, config. Others are install-and-go.
Issue tracker → PRCopilot Coding AgentSET works from specs, not from Jira/Linear/GitHub Issues.
Slack triggerDevinCan't trigger SET from Slack.
File-event hooksKiroSET hooks are at orchestration level, not IDE file-save events.
MCP marketplaceClineSET has a custom MCP server, not a marketplace for third-party tools.
Quick prototypingLovable, Cursor, Claude CodeSET's spec-driven workflow adds upfront overhead. For a quick prototype, Claude Code alone is faster.
Spec writingThe spec is a bottleneck: orchestration quality is bounded by spec quality. Writing a good spec takes effort.

These are conscious trade-offs, not oversights. SET optimizes for orchestration depth over integration breadth. The overhead cost is real — SET is not for quick prototypes. It's for when you already know what to build and want deterministic, reproducible implementation.

Practical

What does a spec need to contain?

Your spec is the single most important input. Required:

  1. Project overview — What, who, tech stack
  2. Data model — Entities, fields, relationships
  3. Page layouts — Sections, columns, components
  4. Component behavior — Click, hover, state changes
  5. Auth & roles — Permissions, protected routes
  6. Seed data — Realistic initial data
  7. Design tokens — Brand colors (hex), fonts, spacing
  8. E2E test expectations — Critical flows

Each requirement needs a REQ-ID (REQ-AUTH-01) and at least one WHEN/THEN scenario.

How long does an orchestration take?

ProjectChangesWall timeTokensInterventions
Micro-Web (simple)3-4~45m~1M0
MiniShop (e-commerce)61h 45m2.7M0
CraftBrew (complex)15~6h~11M0

Token scaling is super-linear (4x tokens for 2.5x changes) because later changes require more context from merged code.

What are the self-healing capabilities?

Gate-level:

  • Test failure → agent reads error, fixes, reruns
  • Build error → agent reads type error, fixes it
  • E2E failure → agent sees Playwright trace, updates selectors
  • Type mismatch → agent syncs main, resolves

Sentinel-level:

  • Agent crash → detected in 30s, auto-restart
  • Agent stall → watchdog escalates: warn → restart → redispatch → fail
  • Orphaned worktree → cleaned up on restart

How do I get started?

# Install SET
pip install -e .
pip install -e modules/web

# Initialize a project
set-project init --name my-app --project-type web --template nextjs

# Write your spec (docs/spec.md)

# Start orchestration
curl -X POST http://localhost:7400/api/my-app/sentinel/start \
  -H 'Content-Type: application/json' \
  -d '{"spec":"docs/spec.md"}'

Or step-by-step:

/opsx:explore   → Think through the problem
/opsx:ff change → Generate all artifacts
/opsx:apply     → Implement
/opsx:verify    → Check
/opsx:archive   → Done

The Big Picture

What problem does SET actually solve?

The gap between "AI can write code" and "AI can ship software."

Writing code is 20% of the work. The other 80%: decomposing requirements, coordinating parallel work, handling conflicts, running quality checks, managing merge order, recovering from failures, learning from mistakes.

SET automates the 80%.

Why specs instead of prompts?

"Build a webshop" produces a different webshop every time.

"Build a webshop with these 28 data models, these 12 pages, these design tokens, these auth rules, these seed data records, and these E2E scenarios" produces the same webshop every time.

The spec is the determinism layer. MiniShop: 83/100 structural convergence score across independent runs (measured by set-compare). Without specs, convergence approaches 0%.

How is this different from just running CI/CD?

CI/CD validates code after someone creates a PR. SET manages the entire pipeline before the PR exists:

CI/CDSET
PR created → tests → review → mergeSpec → decompose → dispatch → gates → merge → replan

CI/CD assumes someone creates the PR. SET creates the PRs, validates them, merges them, and identifies what's still missing.

Why not abstract over multiple LLMs?

Because depth beats breadth. SET leverages Claude-specific capabilities: 200K+ context, native tool use, worktree support, hooks, MCP. Abstracting to a lowest-common-denominator API would sacrifice these for theoretical portability.

SET bets on Claude getting better — and compounds that bet.

What's the competitive moat?

The combination. No other tool provides all six:

Pillar 1

Structured Specs

Traceable requirements with WHEN/THEN scenarios, not prompts

Pillar 2

Parallel Agents

Isolated worktrees, dedicated agents, across machines

Pillar 3

Quality Gates

Deterministic: exit codes, not vibes. Build, test, E2E, review.

Pillar 4

Merge Pipeline

Automated conflict resolution, post-merge verification

Pillar 5

Sentinel Supervision

Crash recovery in 30s, stall detection, budget tracking

Pillar 6

Persistent Memory

Cross-session learning, convention compliance, continuous improvement

Most tools have 1-2 of these. Closest competitors have 2-3. The value is in the integration — the six capabilities reinforce each other. Structured specs enable meaningful gates. Gates enable autonomous merging. Memory enables learning. The sentinel enables unattended operation.