Swisscheese: Differentiators for a Multi-Agent AI Code Review Platform

Context: Swisscheese is a platform for scaling up agents (Claude Code, Codex, etc.) to review AI-generated code. Named after the Swiss cheese model from safety engineering -- stack imperfect layers so the holes don't align.

The Core Problem

AI-generated code volume is exploding. Human reviewers are the bottleneck. But AI reviewing AI has a fundamental risk: correlated failures -- the same blind spots that caused the bug may also cause the reviewer to miss it.

Swisscheese needs to make multi-agent review more trustworthy than any single reviewer, human or AI.

Six High-Impact Differentiators

1. Adversarial Multi-Model Review (the actual Swiss cheese)

This is the namesake and should be the core moat.

The insight: Claude reviewing Claude's code has correlated blind spots. GPT reviewing GPT's code has the same problem. But Claude reviewing GPT's code (or vice versa) has uncorrelated failure modes -- different training data, different reasoning patterns, different biases.

What to build:

Assign different agent providers to different review dimensions (security, correctness, performance, style)
Force cross-model review: if code was generated by Claude, review with Codex and Gemini
Each reviewer gets a different persona/prompt: "You are a security auditor", "You are a performance engineer", "You are the maintainer who'll debug this at 3am"

Multica's Backend interface pattern (server/pkg/agent/agent.go:15-21) is directly reusable -- abstract 10 providers behind one Review(diff, context) -> []Finding contract.

2. Disagreement-as-Signal

Most review tools show N independent reviews. That's just noise multiplication. The real value is in disagreement detection.

What to build:

When 3 reviewers agree a function is fine but 1 flags a race condition, that's a high-value signal
Disagreement score per code region: reviewers_flagging / total_reviewers
Auto-escalate to human review when disagreement exceeds threshold
Dashboard: "These 5 hunks had full agreement (auto-approve). These 2 had disagreement (needs human eyes)."

Why this is a differentiator: No existing tool does this. GitHub Copilot review gives one opinion. Running 3 agents gives 3 opinions. Swisscheese would give a synthesized confidence assessment.

The pitch: "Swisscheese doesn't give you more reviews. It tells you where the holes are."

3. Review-Specific Context Window

Multica gives agents full codebase context. Review needs a different, more structured context:

┌─ Review Context ─────────────────────────────────────┐
│                                                       │
│  1. The Diff (what changed)                           │
│  2. The Intent (PR description, linked issue, spec)   │
│  3. The Blast Radius (what depends on changed files)  │
│  4. The History (recent changes to same files,        │
│     past review comments, known fragile areas)        │
│  5. The Rules (coding standards, security policies,   │
│     team-specific patterns)                           │
│  6. The Test Coverage (what's tested, what isn't)     │
│                                                       │
└───────────────────────────────────────────────────────┘

Most agents just see the diff. Swisscheese should assemble this full context automatically -- that's what makes agent reviews approach human reviewer quality.

4. Structured Finding Taxonomy (Not Free-Text Comments)

Agents produce noisy, verbose review comments. Force structure:

Finding:
  severity:    critical | warning | nit | question
  category:    security | correctness | performance | style | test-coverage
  location:    file:line_start-line_end
  claim:       "This SQL query is vulnerable to injection"
  evidence:    "User input flows from line 42 to line 67 without sanitization"
  suggestion:  "<concrete code fix>"
  confidence:  0.0-1.0

Why this matters:

Structured findings are dedupable (3 agents flagging the same line = 1 finding with higher confidence)
They're filterable (show me only security criticals, hide nits)
They're measurable (track precision/recall over time)
They feed the disagreement engine (compare findings by location)

5. Feedback Loop: Review Quality Scoring (long-term moat)

This is where long-term compounding value is built.

Track per finding:

Was it accepted (developer made the change)?
Was it dismissed (developer ignored it)?
Was it a false positive (developer explicitly marked as wrong)?
Did the flagged code later cause a bug in production?

Use this to:

Tune which agent+prompt combinations are most accurate per category
Auto-suppress review patterns with high false-positive rates
Surface "this agent catches 90% of security issues but only 40% of performance issues" -- then route accordingly
Report to users: "Swisscheese caught 47 bugs this month that would have shipped. 12 were critical."

No existing tool closes this loop. It's expensive to build but creates compounding value.

6. Incremental Re-Review

When a developer pushes a fix in response to a review comment, most tools re-review the entire PR. Waste of tokens and time.

What to build:

Track which findings map to which code regions
On new push, only re-review regions that changed
Auto-resolve findings where the suggested fix was applied
Flag findings where the region changed but the concern wasn't addressed

Operationally critical at scale -- if you're reviewing hundreds of PRs/day, re-reviewing everything is prohibitively expensive.

What to Borrow from Multica

Multica Pattern	Reuse?	Adaptation for Swisscheese
`Backend` interface for 10 providers (`agent.go:15-21`)	Yes	`Reviewer` interface with `Review(diff, context) -> []Finding`
Token usage tracking (per-model, per-task)	Yes	Critical for cost management at review scale
Agent-as-subprocess (`claude.go:22-212`)	Yes	Same approach -- spawn agent CLIs, parse structured output
WS + cache invalidation for real-time (`realtime/`, `use-realtime-sync.ts`)	Yes	Stream review progress live
Prompt injection via CLAUDE.md (`runtime_config.go:41-242`)	Adapt	Inject review-specific rules and context instead of task context
PII/secret redaction (`redact/redact.go`)	Yes	Code diffs may contain secrets too

What Multica Gets Wrong for Review (Fix in Swisscheese)

Multica Gap	Risk for Review	Swisscheese Fix
No quality assessment of agent output	Review output quality IS the product	Structured findings + confidence scores + feedback loop
No spending limits (tracked but not capped)	Runaway costs at review scale	Per-PR and per-org token budgets
Single opinion per task	Misses correlated failures	Multi-opinion + synthesis + disagreement scoring
Conversational loop prevention	Not applicable	Reviews are structured, not conversational -- different problem
Synchronous event bus	Review is batch-oriented	Async queue may fit better for fan-out to N reviewers
Agent instructions as free text	Hard to measure review quality	Structured review prompts with explicit taxonomy

Architecture Sketch

        GitHub/GitLab Webhook (PR opened/updated)
                      │
                      ▼
            ┌─── Swisscheese Server ───┐
            │                          │
            │  Context Assembler       │
            │  (diff + intent +        │
            │   blast radius +         │
            │   history + rules)       │
            │         │                │
            │         ▼                │
            │  Review Orchestrator     │
            │  ┌──────┼──────┐        │
            │  ▼      ▼      ▼        │
            │ Agent1 Agent2 Agent3    │  ← different models/prompts
            │ (sec)  (logic) (perf)   │
            │  │      │      │        │
            │  ▼      ▼      ▼        │
            │  Finding Aggregator      │
            │  (dedup, disagree,       │
            │   confidence scoring)    │
            │         │                │
            │         ▼                │
            │  Human Escalation Gate   │
            │  (auto-approve if        │
            │   all agree + high       │
            │   confidence)            │
            │         │                │
            └─────────┼────────────────┘
                      ▼
            Post structured review
            comments back to PR

Prioritization

Priority	Differentiator	Why
MVP	#2 Disagreement-as-signal	Unique capability, immediately valuable, validates the core thesis
MVP	#1 Cross-model review	Enables #2, addresses correlated failures directly
MVP	#4 Structured findings	Required for #2 to work (can't compare free-text)
V2	#3 Review context assembly	Improves quality of each individual review
V2	#6 Incremental re-review	Cost optimization, important at scale
Long-term	#5 Feedback loop	Compounding moat, needs data volume to be useful