Swisscheese: Differentiators for a Multi-Agent AI Code Review Platform
Context: Swisscheese is a platform for scaling up agents (Claude Code, Codex, etc.) to review AI-generated code. Named after the Swiss cheese model from safety engineering -- stack imperfect layers so the holes don't align.
The Core Problem
AI-generated code volume is exploding. Human reviewers are the bottleneck. But AI reviewing AI has a fundamental risk: correlated failures -- the same blind spots that caused the bug may also cause the reviewer to miss it.
Swisscheese needs to make multi-agent review more trustworthy than any single reviewer, human or AI.
Six High-Impact Differentiators
1. Adversarial Multi-Model Review (the actual Swiss cheese)
This is the namesake and should be the core moat.
The insight: Claude reviewing Claude's code has correlated blind spots. GPT reviewing GPT's code has the same problem. But Claude reviewing GPT's code (or vice versa) has uncorrelated failure modes -- different training data, different reasoning patterns, different biases.
What to build:
- Assign different agent providers to different review dimensions (security, correctness, performance, style)
- Force cross-model review: if code was generated by Claude, review with Codex and Gemini
- Each reviewer gets a different persona/prompt: "You are a security auditor", "You are a performance engineer", "You are the maintainer who'll debug this at 3am"
Multica's Backend interface pattern (server/pkg/agent/agent.go:15-21) is directly reusable -- abstract 10 providers behind one Review(diff, context) -> []Finding contract.
2. Disagreement-as-Signal
Most review tools show N independent reviews. That's just noise multiplication. The real value is in disagreement detection.
What to build:
- When 3 reviewers agree a function is fine but 1 flags a race condition, that's a high-value signal
- Disagreement score per code region:
reviewers_flagging / total_reviewers - Auto-escalate to human review when disagreement exceeds threshold
- Dashboard: "These 5 hunks had full agreement (auto-approve). These 2 had disagreement (needs human eyes)."
Why this is a differentiator: No existing tool does this. GitHub Copilot review gives one opinion. Running 3 agents gives 3 opinions. Swisscheese would give a synthesized confidence assessment.
The pitch: "Swisscheese doesn't give you more reviews. It tells you where the holes are."
3. Review-Specific Context Window
Multica gives agents full codebase context. Review needs a different, more structured context:
┌─ Review Context ─────────────────────────────────────┐
│ │
│ 1. The Diff (what changed) │
│ 2. The Intent (PR description, linked issue, spec) │
│ 3. The Blast Radius (what depends on changed files) │
│ 4. The History (recent changes to same files, │
│ past review comments, known fragile areas) │
│ 5. The Rules (coding standards, security policies, │
│ team-specific patterns) │
│ 6. The Test Coverage (what's tested, what isn't) │
│ │
└───────────────────────────────────────────────────────┘
Most agents just see the diff. Swisscheese should assemble this full context automatically -- that's what makes agent reviews approach human reviewer quality.
4. Structured Finding Taxonomy (Not Free-Text Comments)
Agents produce noisy, verbose review comments. Force structure:
Finding:
severity: critical | warning | nit | question
category: security | correctness | performance | style | test-coverage
location: file:line_start-line_end
claim: "This SQL query is vulnerable to injection"
evidence: "User input flows from line 42 to line 67 without sanitization"
suggestion: "<concrete code fix>"
confidence: 0.0-1.0
Why this matters:
- Structured findings are dedupable (3 agents flagging the same line = 1 finding with higher confidence)
- They're filterable (show me only security criticals, hide nits)
- They're measurable (track precision/recall over time)
- They feed the disagreement engine (compare findings by location)
5. Feedback Loop: Review Quality Scoring (long-term moat)
This is where long-term compounding value is built.
Track per finding:
- Was it accepted (developer made the change)?
- Was it dismissed (developer ignored it)?
- Was it a false positive (developer explicitly marked as wrong)?
- Did the flagged code later cause a bug in production?
Use this to:
- Tune which agent+prompt combinations are most accurate per category
- Auto-suppress review patterns with high false-positive rates
- Surface "this agent catches 90% of security issues but only 40% of performance issues" -- then route accordingly
- Report to users: "Swisscheese caught 47 bugs this month that would have shipped. 12 were critical."
No existing tool closes this loop. It's expensive to build but creates compounding value.
6. Incremental Re-Review
When a developer pushes a fix in response to a review comment, most tools re-review the entire PR. Waste of tokens and time.
What to build:
- Track which findings map to which code regions
- On new push, only re-review regions that changed
- Auto-resolve findings where the suggested fix was applied
- Flag findings where the region changed but the concern wasn't addressed
Operationally critical at scale -- if you're reviewing hundreds of PRs/day, re-reviewing everything is prohibitively expensive.
What to Borrow from Multica
| Multica Pattern | Reuse? | Adaptation for Swisscheese |
|---|---|---|
Backend interface for 10 providers (agent.go:15-21) |
Yes | Reviewer interface with Review(diff, context) -> []Finding |
| Token usage tracking (per-model, per-task) | Yes | Critical for cost management at review scale |
Agent-as-subprocess (claude.go:22-212) |
Yes | Same approach -- spawn agent CLIs, parse structured output |
WS + cache invalidation for real-time (realtime/, use-realtime-sync.ts) |
Yes | Stream review progress live |
Prompt injection via CLAUDE.md (runtime_config.go:41-242) |
Adapt | Inject review-specific rules and context instead of task context |
PII/secret redaction (redact/redact.go) |
Yes | Code diffs may contain secrets too |
What Multica Gets Wrong for Review (Fix in Swisscheese)
| Multica Gap | Risk for Review | Swisscheese Fix |
|---|---|---|
| No quality assessment of agent output | Review output quality IS the product | Structured findings + confidence scores + feedback loop |
| No spending limits (tracked but not capped) | Runaway costs at review scale | Per-PR and per-org token budgets |
| Single opinion per task | Misses correlated failures | Multi-opinion + synthesis + disagreement scoring |
| Conversational loop prevention | Not applicable | Reviews are structured, not conversational -- different problem |
| Synchronous event bus | Review is batch-oriented | Async queue may fit better for fan-out to N reviewers |
| Agent instructions as free text | Hard to measure review quality | Structured review prompts with explicit taxonomy |
Architecture Sketch
GitHub/GitLab Webhook (PR opened/updated)
│
▼
┌─── Swisscheese Server ───┐
│ │
│ Context Assembler │
│ (diff + intent + │
│ blast radius + │
│ history + rules) │
│ │ │
│ ▼ │
│ Review Orchestrator │
│ ┌──────┼──────┐ │
│ ▼ ▼ ▼ │
│ Agent1 Agent2 Agent3 │ ← different models/prompts
│ (sec) (logic) (perf) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Finding Aggregator │
│ (dedup, disagree, │
│ confidence scoring) │
│ │ │
│ ▼ │
│ Human Escalation Gate │
│ (auto-approve if │
│ all agree + high │
│ confidence) │
│ │ │
└─────────┼────────────────┘
▼
Post structured review
comments back to PR
Prioritization
| Priority | Differentiator | Why |
|---|---|---|
| MVP | #2 Disagreement-as-signal | Unique capability, immediately valuable, validates the core thesis |
| MVP | #1 Cross-model review | Enables #2, addresses correlated failures directly |
| MVP | #4 Structured findings | Required for #2 to work (can't compare free-text) |
| V2 | #3 Review context assembly | Improves quality of each individual review |
| V2 | #6 Incremental re-review | Cost optimization, important at scale |
| Long-term | #5 Feedback loop | Compounding moat, needs data volume to be useful |