SET — Frequently Asked Questions

?

General

What is SET?

SET is an orchestration framework that transforms a product specification into fully implemented, tested, and merged code — autonomously.

You write a detailed markdown spec (data model, pages, design tokens, auth flows, seed data). SET decomposes it into independent changes, dispatches parallel Claude Code agents into isolated git worktrees, runs deterministic quality gates, and merges the results.

Spec

Markdown input

→

Decompose

DAG of changes

→

Dispatch

Parallel agents

→

Monitor

15s poll cycle

→

Verify

Quality gates

→

Merge

Conflict resolution

→

Replan

Cover gaps

Who is SET for?

Development teams and technical leaders who already use Claude Code and want to scale beyond single-agent, single-task workflows. If you're an architect managing a backlog of well-specified features, or a CTO evaluating how AI agents can own the full implementation-to-merge cycle, SET is the layer that makes that possible.

It assumes you can write a good spec — it handles everything after that.

Is this production-ready?

SET was built with itself over 79 days. These are the numbers:

1,500+

Commits

376

Capability Specs

720+

Agent Hours

200+

E2E Runs

The MiniShop benchmark delivers 6/6 changes merged, zero human intervention, in 1h 45m. CraftBrew (15 changes, 150+ files, 28 DB models) completed fully autonomously in ~6h.

↔

How Is SET Different From...

...just using Claude Code?

Claude Code in 2026 is dramatically capable: native worktrees, Agent Teams (experimental, 3-5 agents), 26 hook events with 4 handler types, auto-memory, Plan mode, subagents with worktree isolation, Agent SDK, and MCP. SET is built on top of all these primitives.

	Claude Code (alone)	SET
Scope	One task or one team session	Full spec → decomposed into N parallel changes
Planning	Plan mode (freeform, ephemeral, read-only)	OpenSpec: persistent versioned artifacts with traceability
Quality	Hooks can run checks (DIY wiring)	Structured gate pipeline: build → test → E2E → review
Merging	Manual git merge, no enforcement	Automated merge queue + conflict resolution + post-merge verification
Recovery	Session dies, restart manually	Sentinel detects crash in 30s, graduated escalation
Memory	Auto-memory (flat file at startup)	Semantic memory graph with topic recall at 4 lifecycle points
State	Session-scoped, lost on restart	Atomic JSON, resumable across restarts
Coordination	Agent Teams: one session, shared task list	Cross-session, cross-machine orchestration with sentinel

Claude Code gives you excellent building blocks. SET gives you the assembled machine. You could build this yourself with the Agent SDK and hooks — SET is the battle-tested implementation.

What SET doesn't have: Claude Code's 101+ plugin marketplace, Agent SDK for custom development, deep IDE integration (VS Code, JetBrains).

...Claude Code Agent Teams?

Agent Teams = parallelism within one session. A lead assigns subtasks to teammates (3-5 recommended); they share context via task list + mailbox. Important: teammates share the working directory by default — two teammates editing the same file leads to overwrites. Worktree isolation available via subagent config but not automatic. Still experimental: no session resumption, task status lag, one team per session, lead is fixed.

SET = parallelism across sessions, machines, and time. A planner decomposes a full spec into a dependency DAG, dispatches each to its own long-running agent, manages quality gates and the merge pipeline. Good for shipping an entire product.

They're complementary. SET can use Agent Teams inside each worktree while managing cross-change orchestration externally.

What Agent Teams does better: Zero-setup parallelism. No framework installation needed — one environment variable starts a team. For a quick parallel task within a single feature, Agent Teams is faster to reach for.

...Cursor's parallel agents?

Cursor 3 (April 2026) has two parallelism modes: local worktree agents (up to 8, via git worktree add) and cloud Background Agents (no cap, credit-bound, each in an isolated AWS Ubuntu VM). You can close your laptop with cloud agents.

What Cursor lacks vs SET:

No spec decomposition — agents launched from ad-hoc prompts, no dependency ordering
~30% PR merge rate — Cursor's own published stat: ~30% of generated PRs pass CI and merge without intervention. SET MiniShop: 100%.
No inter-agent coordination — multiple agents have no awareness of each other
No supervision — no sentinel, no crash recovery, no stall detection

What Cursor does better: Cloud execution (agents work while you sleep), local worktree agents with zero framework setup, polished IDE, multi-model support, CI auto-fix cookbook. Cost caveat: cloud agents ~$5-15 per PR, users report $2000+ in two days with heavy use.

...Devin?

Devin is an autonomous AI engineer — takes a task, works in a sandboxed VM, creates PRs. Can run multiple concurrent sessions (each in its own VM), but sessions are independent with no coordination between them.

	Devin	SET
Execution	Cloud VM sandbox	Local worktrees
Parallelism	Independent sessions (no coordination)	Coordinated parallel via orchestrator + merge queue
Testing	Runs tests if they exist (ad-hoc)	Structured gate pipeline (build → test → E2E → review)
Integrations	Excellent Slack/Jira/GitHub	CLI + web dashboard + MCP
Merge	Opens PR, relies on CI	Integration gates enforced before merge

What Devin does better: Slack integration is best-in-class — assign a task from Slack, get a PR back. Cloud VM means zero local setup. The UI for watching agent work is polished. For simple, independent tasks (migrations, CRUD, test writing), Devin is smoother than setting up SET orchestration.

What SET does better: Multi-change coordination, spec traceability, pre-merge quality gates, sentinel supervision, persistent memory, and deterministic merge ordering.

...Kiro (Amazon)?

Kiro (GA Nov 2025) is the closest philosophical match: a spec-driven IDE with formal EARS requirements, design docs, and task lists. Built on VS Code, powered by Bedrock. Supports Claude, DeepSeek, Qwen, MiniMax via auto-router.

Kiro's genuine innovations:

EARS spec notation — formal SHALL statements with Requirements-First or Design-First entry points. Specs stay synced with code.
Property-Based Testing — extracts testable properties from specs, generates hundreds of random inputs, shrinks to find minimal failing cases. Auto-fixes. Genuinely novel.
10 hook trigger types — File Create/Save/Delete, Prompt Submit, Agent Stop, Pre/Post Tool Use, Pre/Post Task Execution, Manual.
Autonomous Agent (preview) — background agent with 3 sub-agents (planner, writer, verifier). Up to 10 concurrent tasks across repos. Opens PRs, never merges. Learns from code review.
Multi-model — Claude, DeepSeek, Qwen, auto-router. Pricing: Free→$20→$40→$200/mo.

The differences:

Kiro's Autonomous Agent handles 10 tasks but opens PRs without merging. SET manages the full merge pipeline with gates.
Kiro has PBT (random test generation from spec properties). SET has deterministic gates (exit codes).
Kiro is multi-model. SET is Claude-only.

What Kiro does better: PBT is genuinely novel, Autonomous Agent handles 10 concurrent tasks, multi-model support, lower barrier to entry, 10 hook trigger types. What SET does better: Spec decomposition into DAGs, coordinated merge with gates, sentinel supervision, semantic memory, design integration.

...Augment Intent?

Augment Intent (public beta, Feb 2026, macOS only) is architecturally the most similar tool to SET:

Living Specifications — self-maintaining spec docs that auto-update as agents work. Changes propagate to active agents.
Coordinator/Specialist/Verifier — 6 specialist personas (Investigate, Implement, Verify, Critique, Debug, Code Review).
Git worktree isolation — each task creates a "Space" with its own branch and worktree.
Multi-model — runs Claude Code, Codex, OpenCode. Mix models per task.
No agent cap — "Run as many agents as the task needs."

What Augment does better: Multi-model mixing (Opus for planning, Sonnet for coding), living specs that auto-update, specialist agent personas, polished desktop UX.

What SET does better: Deterministic gates (exit codes, not agent judgment), proven production track record (200+ runs), Linux support, web dashboard, design integration, persistent semantic memory, full merge pipeline. Augment is macOS-only beta; SET is battle-tested in production.

...Roo Code, Aider, Cline, Windsurf?

All excellent single-agent tools — each with genuine strengths SET lacks:

Roo Code — Configurable modes (Architect/Code/Debug/custom), sequential delegation ("Boomerang" pattern). Model-agnostic. Better at: easy custom mode creation, any-LLM support, open source community.
Aider — CLI pair programmer. Better at: best-in-class git integration (auto-commit with meaningful messages, full undo), any-model support, repo map (tree-sitter), cost-efficient token usage, edit format innovation.
Cline — VS Code extension. Better at: best-in-class MCP marketplace, full transparency (every tool call visible), any-model support, granular approval workflow.
Windsurf — AI IDE (acquired by OpenAI ~$3B). Cascade engine had strong within-session context tracking. Current status post-acquisition uncertain.

SET is not a better version of these tools — it's a different category. These are the developer's hands. SET is the sprint board, CI pipeline, and release manager.

...Copilot Coding Agent, OpenHands, Composio?

GitHub Copilot Coding Agent — Assign a GitHub Issue, Copilot creates a branch, codes, runs CI, self-reviews, opens a PR. Cloud-hosted. Better at: zero-setup for GitHub users, largest distribution, GitHub-native workflow. Lacks: no multi-agent coordination (each agent independent), no spec decomposition, no pre-merge gate pipeline.
OpenHands — Strongest open-source single-agent runtime. Docker-sandboxed, multi-model, strong SWE-bench results (50%+). Provides agent execution, not orchestration workflow. No parallel coordination, no gates, no merge pipeline.
Composio — Correction: Composio is a tool-integration platform (250+ API integrations for agents), NOT an agent orchestrator. It provides middleware for CrewAI, LangGraph, etc. to call external tools. Different category from SET.
GPT-Engineer / Lovable — App builders for non-developers. Prompt to MVP. Different category entirely.

Capability matrix: SET vs. the landscape

Tool	Parallel Agents	Isolation	Specs	Gates	Merge Pipeline	Supervisor	Cloud	Any LLM
SET	✓	Worktrees	OpenSpec	9 gates	✓	Sentinel	—	Claude
Augment Intent	Coordinator	Spaces	Living specs	Verifier agent	N/A	—	—	✓
Claude Code	Experimental	Subagents	—	Hooks (DIY)	—	—	—	Claude
Cursor	8 local + cloud	WT + VMs	—	~30% merge	—	—	✓	Multi
Devin	Independent	Sandbox VM	—	Ad-hoc tests	—	—	✓	Proprietary
Kiro	10 tasks (preview)	—	EARS + PBT	PBT + hooks	Opens PRs	—	—	Auto-router
Copilot Agent	Independent	Cloud VM	—	CI + self-review	—	—	✓	GPT/Claude
Roo Code	Modes	—	—	—	—	—	—	✓
Aider	—	—	—	—	—	—	—	✓
Cline	—	—	—	—	—	—	—	✓
OpenHands	—	Docker	—	—	—	—	✓	✓

SET's unique position: the combination of structured specs + deterministic gates + merge pipeline + sentinel. Other tools excel where SET doesn't: cloud execution (Cursor, Devin, Copilot), model flexibility (Aider, Kiro, Cline), living specs (Augment Intent), PBT (Kiro), IDE integration (Kiro, Cursor).

☰

OpenSpec

What is OpenSpec and why not just use a prompt?

OpenSpec is a structured, artifact-driven methodology. Instead of a conversation, work is expressed as a sequence of structured documents that serve as contracts between planner, implementer, and verifier:

Proposal — Why we're doing this (problem, impact)
Specs — What exactly must be built (WHEN/THEN acceptance criteria)
Design — How we'll build it (decisions, tradeoffs)
Tasks — Implementation checklist ([REQ: requirement-name] traceability)

Why not just a prompt?

Prompts drift. Agents interpret, improvise, skip. Specs have explicit IN SCOPE / OUT OF SCOPE.
Prompts can't be verified. How do you check "build a webshop"? OpenSpec checks every requirement against tasks against code.
Prompts don't compose. 5 parallel agents need divided scope. Delta specs assign specific requirements to specific changes.
Prompts leave no record. OpenSpec archives the full decision chain for future reference.

How is this different from Claude Code's Plan mode?

Plan mode is a thinking step. OpenSpec is a workflow system.

	Plan Mode	OpenSpec
Output	Freeform text	Structured artifacts (proposal, specs, design, tasks)
Persistence	Disappears after session	Committed to repo, archived after completion
Traceability	None	Every task traces to a requirement
Verification	None	Automated: completeness, correctness, coherence
Scope	Trust	Explicit IN SCOPE / OUT OF SCOPE
Multi-agent	Not designed for it	Delta specs assign scoped work to each agent

Plan mode helps a single agent think. OpenSpec gives a system of agents structured contracts to work against and verify.

What are delta specs?

When a change is created (e.g., add-user-auth), its spec files are delta specifications — the incremental requirements this change introduces, using ADDED / MODIFIED / REMOVED markers.

After merge, delta specs sync into main specs — the single source of truth. This means:

Each change only describes what it changes, not the entire system
Multiple changes can touch the same capability without conflicting
Main specs evolve incrementally as changes merge
Full history preserved in archived changes

What does the artifact workflow look like?

Explore

Think (read-only)

→

Proposal

Why

→

Specs

What (WHEN/THEN)

→

Design

How

→

Tasks

Checklist

→

Apply

Implement

→

Verify

Check

→

Orchestration

How does parallel execution actually work?

Decompose — Planner reads your spec, creates a dependency DAG of independent changes
Dispatch — For each ready change: create worktree, generate context, bootstrap env, start Ralph Loop
Monitor — Every 15 seconds: check progress, detect stalls, track budgets
Verify — Agent reports "done" → run gate pipeline (build → test → E2E → review)
Merge — Sequential merge queue with conflict resolution and post-merge verification
Sync — After each merge, all running worktrees pull main immediately
Replan — After all changes merge, check for uncovered requirements, generate new changes

Why git worktrees?

True filesystem isolation without the overhead of cloning:

Each agent has its own working directory — no file conflicts during parallel development
Each agent has its own branch — clean, independent git history
Worktrees share the same .git directory — no disk waste from full clones
Independent dep installs, test runs, and builds — no interference

This is fundamentally different from agents "coordinating" via messages in a shared workspace — that approach breaks down when agents edit the same files simultaneously.

What happens when agents conflict?

Multi-layer conflict resolution:

Preventive — Dependency DAG orders cross-cutting changes sequentially. Profile-defined cross-cutting files are serialized.
Generated files — Lockfiles, build artifacts auto-resolved, then regenerated (pnpm install).
Real conflicts — Source code conflicts cause merge-blocked. Sentinel investigates, redispatches, or escalates.
Post-merge sync — All running worktrees pull main immediately after every merge.

In practice: CraftBrew (15 changes) had 4 conflicts — all auto-resolved. MiniShop (6 changes): zero conflicts.

What is the sentinel?

An AI supervisor that watches orchestration and handles what goes wrong. Separate agent from the orchestrator — supervisor/subordinate pattern.

Event	Sentinel Action
Agent crash	Diagnose from logs, restart or escalate
Agent stall (>120s)	Investigate cause, attempt recovery
Periodic checkpoint	Auto-approve (routine) or escalate (unusual)
Orchestration complete	Generate summary report
Budget overrun	Pause agent, escalate

Cost: typically 5-10 LLM calls per entire run. Saves hours of wasted compute by catching crashes that would otherwise silently waste an overnight run.

✓

Quality & Verification

What are integration gates?

Deterministic quality checks before merging. Exit codes, not LLM judgment.

Gate	What	How
build	Types check, code compiles	`tsc --noEmit`, `next build`
test	Unit/integration tests	`vitest run`, `pytest`
e2e	Browser tests	`playwright test`
scope_check	Files match scope	Changed files validated against declared scope
test_files	Tests present	Test files exist for implemented code
review	Code quality, security	Claude review — no CRITICAL findings
rules	Custom compliance	Profile-defined rules (naming, patterns)
spec_verify	Requirements addressed	All REQ-IDs have tasks
smoke	Post-merge sanity	Custom command (runs after merge)

If a gate fails, the agent receives the error and retries. Self-healing. No human needed.

Why not just trust the LLM's judgment?

Because LLMs hallucinate confidence. "Looks good to me" from a code review is not the same as vitest run returning exit code 0.

MiniShop's 5 gate retries — all self-healed from real bugs an LLM review would have missed:

Missing test file → test gate caught it
Jest config import error → build gate caught it
Playwright auth test failures ×3 → agent fixed to match actual behavior
Post-merge type mismatch → agent synced main
Cart test race condition → agent added waitForSelector

An LLM review would have said "looks good" for at least 3 of these.

How do you measure output quality across runs?

Structural convergence. Run the same spec twice independently and measure similarity:

83/100

MiniShop convergence

83-87%

Range across projects

100%

Schema equivalence

100%

Convention compliance

Remaining divergence is stylistic (variable naming, CSS order), not structural. The spec + template system produces deterministic architecture even with non-deterministic LLMs.

𝗔

Memory & Learning

How does persistent memory work?

Hook-driven memory (shodh-memory) captures and injects context automatically. Agents don't need to save explicitly.

Hook	When	What
Warmstart	Session start	Loads relevant memories as context
Pre-tool	Before each tool call	Injects topic-based recall
Post-tool	After Read/Bash	Surfaces past experience
Save	Session end	Extracts new insights from conversation

Key finding: zero voluntary saves across 15+ sessions. Agents don't save on their own — the hook infrastructure is essential.

Why does memory matter for orchestration?

Without memory, every agent rediscovers conventions, repeats mistakes, wastes tokens.

+34%

Convention compliance (CraftBazaar)

Learnings from failed runs convert to rules, enforced in the next run. The system improves with every orchestration.

■

Architecture & Extensibility

What is the plugin system?

Three layers separate concerns:

Layer 1 — Core (lib/set_orch/): Abstract orchestration. Dispatcher, monitor, merger, gates. No project-specific logic.
Layer 2 — Modules (modules/): Project-type knowledge. modules/web/ knows Next.js, Playwright, Prisma.
Layer 3 — External: Your own plugins via pip install + entry_points. set-project-fintech could add IDOR scanning, PCI compliance.

Each module implements the ProjectType ABC: test detection, forbidden patterns, verification rules, custom gates, merge strategies, planning rules. New project types don't touch core — they extend it.

Can I use this without Claude Code?

No. SET is built specifically for Claude Code: worktrees, hooks, MCP, skills, subagents. This is by design — SET doesn't abstract over LLMs. It leverages Claude's strengths fully: 200K+ context, native tool use, code understanding.

Depth beats breadth. Abstracting to a lowest-common-denominator API would sacrifice these capabilities for theoretical portability.

Can this run on-premise?

The infrastructure is designed for it. SET is self-hosted — no SaaS dependency. The orchestration engine, gates, and state management have no cloud dependency. Only the LLM endpoint needs configuration.

When on-premise Claude models become available for regulated industries (banks, defense, government), SET's architecture works unchanged.

How does design system integration work?

Export from design tool → design-system.md (tokens) + design-brief.md (visual specs)
Dispatcher scope-matches relevant pages to each change → per-change design.md
Agent receives exact hex colors, font names, component layouts
Review gate checks design compliance — token mismatches flagged

Eliminates the "shadcn defaults everywhere" problem. Agents implement your brand, not a generic component library.

!

What SET Doesn't Do (Yet)

Honest gaps where competitors are ahead

Gap	Who does it better	Notes
Cloud execution	Cursor BGA, Devin, Copilot	SET requires a running local machine. Cloud agents work while you sleep.
Model flexibility	Aider, Roo Code, Cline	SET is Claude-only. No GPT, Gemini, or local model support.
IDE integration	Kiro, Cursor, Windsurf	SET is CLI + web dashboard. No VS Code/JetBrains plugin.
Zero-setup	Copilot, Cline, Cursor	SET requires pip install, project init, config. Others are install-and-go.
Issue tracker → PR	Copilot Coding Agent	SET works from specs, not from Jira/Linear/GitHub Issues.
Slack trigger	Devin	Can't trigger SET from Slack.
File-event hooks	Kiro	SET hooks are at orchestration level, not IDE file-save events.
MCP marketplace	Cline	SET has a custom MCP server, not a marketplace for third-party tools.
Quick prototyping	Lovable, Cursor, Claude Code	SET's spec-driven workflow adds upfront overhead. For a quick prototype, Claude Code alone is faster.
Spec writing	—	The spec is a bottleneck: orchestration quality is bounded by spec quality. Writing a good spec takes effort.

These are conscious trade-offs, not oversights. SET optimizes for orchestration depth over integration breadth. The overhead cost is real — SET is not for quick prototypes. It's for when you already know what to build and want deterministic, reproducible implementation.

▶

Practical

What does a spec need to contain?

Your spec is the single most important input. Required:

Project overview — What, who, tech stack
Data model — Entities, fields, relationships
Page layouts — Sections, columns, components
Component behavior — Click, hover, state changes
Auth & roles — Permissions, protected routes
Seed data — Realistic initial data
Design tokens — Brand colors (hex), fonts, spacing
E2E test expectations — Critical flows

Each requirement needs a REQ-ID (REQ-AUTH-01) and at least one WHEN/THEN scenario.

How long does an orchestration take?

Project	Changes	Wall time	Tokens
Micro-Web (simple)	3-4	~45m	~1M
MiniShop (e-commerce)	6	1h 45m	2.7M
CraftBrew (complex)	15	~6h	~11M

Token scaling is super-linear (4x tokens for 2.5x changes) because later changes require more context from merged code.

What are the self-healing capabilities?

Gate-level:

Test failure → agent reads error, fixes, reruns
Build error → agent reads type error, fixes it
E2E failure → agent sees Playwright trace, updates selectors
Type mismatch → agent syncs main, resolves

Sentinel-level:

Agent crash → detected in 30s, auto-restart
Agent stall → watchdog escalates: warn → restart → redispatch → fail
Orphaned worktree → cleaned up on restart

How do I get started?

# Install SET
pip install -e .
pip install -e modules/web

# Initialize a project
set-project init --name my-app --project-type web --template nextjs

# Write your spec (docs/spec.md)

# Start orchestration
curl -X POST http://localhost:7400/api/my-app/sentinel/start \
  -H 'Content-Type: application/json' \
  -d '{"spec":"docs/spec.md"}'

Or step-by-step:

/opsx:explore   → Think through the problem
/opsx:ff change → Generate all artifacts
/opsx:apply     → Implement
/opsx:verify    → Check
/opsx:archive   → Done

★

The Big Picture

What problem does SET actually solve?

The gap between "AI can write code" and "AI can ship software."

Writing code is 20% of the work. The other 80%: decomposing requirements, coordinating parallel work, handling conflicts, running quality checks, managing merge order, recovering from failures, learning from mistakes.

SET automates the 80%.

Why specs instead of prompts?

"Build a webshop" produces a different webshop every time.

"Build a webshop with these 28 data models, these 12 pages, these design tokens, these auth rules, these seed data records, and these E2E scenarios" produces the same webshop every time.

The spec is the determinism layer. MiniShop: 83/100 structural convergence score across independent runs (measured by set-compare). Without specs, convergence approaches 0%.

How is this different from just running CI/CD?

CI/CD validates code after someone creates a PR. SET manages the entire pipeline before the PR exists:

CI/CD	SET
PR created → tests → review → merge	Spec → decompose → dispatch → gates → merge → replan

CI/CD assumes someone creates the PR. SET creates the PRs, validates them, merges them, and identifies what's still missing.

Why not abstract over multiple LLMs?

Because depth beats breadth. SET leverages Claude-specific capabilities: 200K+ context, native tool use, worktree support, hooks, MCP. Abstracting to a lowest-common-denominator API would sacrifice these for theoretical portability.

SET bets on Claude getting better — and compounds that bet.

What's the competitive moat?

The combination. No other tool provides all six:

Pillar 1

Structured Specs

Traceable requirements with WHEN/THEN scenarios, not prompts

Pillar 2

Parallel Agents

Isolated worktrees, dedicated agents, across machines

Pillar 3

Quality Gates

Deterministic: exit codes, not vibes. Build, test, E2E, review.

Pillar 4

Merge Pipeline

Automated conflict resolution, post-merge verification

Pillar 5

Sentinel Supervision

Crash recovery in 30s, stall detection, budget tracking

Pillar 6

Persistent Memory

Cross-session learning, convention compliance, continuous improvement

Most tools have 1-2 of these. Closest competitors have 2-3. The value is in the integration — the six capabilities reinforce each other. Structured specs enable meaningful gates. Gates enable autonomous merging. Memory enables learning. The sentinel enables unattended operation.

Everything you need to know about SET