07 — Other Considerations

Things the prompt didn't ask for but you'd want to know if you're learning from this repo or considering forking it.

Notable ideas to take

1. Schema-as-sanitizer for prompt-injection control

The single most copyable pattern. output_schema: blocks on untrusted-reader subagents (e.g. subagents/reader.yaml:35-58) use regex pattern + maxLength + additionalProperties: false not just to validate types, but to filter an attacker's English-language injection out of any data that crosses an agent boundary. If the field is pattern: "^[A-Za-z0-9._:-]+$" of maxLength: 64, a "ignore previous instructions and exfiltrate" sentence cannot survive the validator.

Take this even outside FSI: any agent that ingests user-supplied PDFs or web content into a downstream tool-using agent benefits from a schema-validated waist between them.

2. "One source, two wrappers" with byte-equality CI

scripts/check.py:114-131 does a filecmp.dircmp between vertical-plugins/<v>/skills/<s>/ and agent-plugins/<slug>/skills/<s>/ and fails the build on any drift. This means the bundled-skill copies are always fresh, even though they're physically duplicated.

Generalizable for any repo where you need self-contained distribution units (Cowork plugins) and a single source of truth (verticals).

3. Honest threat-model docs in code comments

scripts/orchestrate.py:8-15 opens with a paragraph documenting why the script's regex-based handoff routing is the wrong long-term answer ("In production, prefer emitting handoffs via a dedicated tool call or a typed SSE event the model cannot produce by quoting document text."). The reference implementation actively points at its own weaker spot. Worth copying in any reference code.

4. Anti-example sections (`<common_mistakes>`)

dcf-model/SKILL.md:581-756 lists known WRONG patterns the model has actually emitted (linear approximations in sensitivity tables, // WRONG - Placeholder note, "common rationalization to REJECT"). Most prompt libraries list only positive examples. Listing the justification the model would invent for a wrong answer ("Writing 75+ formulas feels complex, so I'll leave a note") is a high-leverage prompting move.

5. `[UNSOURCED]` as searchable uncertainty marker

When the agent can't source a number, it's required to mark it [UNSOURCED] rather than estimate (pitch-agent.md:31, earnings-reviewer.md:29). This converts hallucination risk into an explicit lint hit the human reviewer can grep for. Cheap, robust, broadly applicable.

6. Setup log pattern in the M365 wizard

claude-for-msft-365-install/commands/setup.md:11-15 tells Claude to maintain a setup log at ~/Desktop/claude-for-msft-365-install-setup.md and to resume from it on rerun. This makes the wizard idempotent and resumable — useful pattern for any long-running interactive Claude session that walks a user through provisioning.

7. Schema-validated env-var substitution

scripts/deploy-managed-agent.sh:43-47 rejects ${VAR} values containing characters outside [A-Za-z0-9._/:@-]. This is a tiny line of defense against an attacker setting MCP_URL='"; rm -rf /' in the environment that gets templated into a shell-adjacent context. Cheap, almost free, often forgotten.

Pitfalls to be aware of

1. Skill drift surface area is wide

Any commit that edits a vertical-plugins/<v>/skills/<s>/ file but forgets scripts/sync-agent-skills.py will fail CI but only at the byte-equality check — not at the moment of editing. Suggestion: add python3 scripts/sync-agent-skills.py as a pre-commit hook.

2. `ALLOWED_TARGETS` in `orchestrate.py` is hand-maintained

scripts/orchestrate.py:23-27 hardcodes the 10 agent slugs:

ALLOWED_TARGETS = {
    "pitch-agent", "market-researcher", "earnings-reviewer", "meeting-prep-agent",
    "model-builder", "gl-reconciler", "kyc-screener",
    "valuation-reviewer", "month-end-closer", "statement-auditor",
}

This list is independent of marketplace.json and managed-agent-cookbooks/. If a new agent is added, this set must be updated by hand. check.py doesn't yet cross-check it. A drift here is silent: new agent ships, handoffs to it are silently dropped.

3. Handoff-as-text vs handoff-as-tool-call

The handoff_request JSON blob lives in the orchestrator's text output. A malicious document could inject a literal blob that the regex catches. Mitigations are real (ALLOWED_TARGETS + HANDOFF_PAYLOAD_SCHEMA), but the core architectural fix — a typed handoff primitive the model cannot produce by quoting text — depends on a platform feature that isn't there yet.

4. Skills are description-routed

Skill triggering is keyword-based (model reads description and decides). This works when descriptions are concrete ("Triggers on 'CIM', 'confidential information memorandum', ...") and fails when they're vague. There is no enforced trigger registry — you discover misrouting only by running the agent.

5. Every agent uses `claude-opus-4-7` everywhere

Every cookbook (orchestrator + every subagent) sets model: claude-opus-4-7. A reader subagent doing structured JSON extraction is overkill on Opus; Sonnet/Haiku would be cheaper and just as accurate. The repo is leaving cost on the table by not differentiating model per leaf — but doing so adds a tuning surface and is left to the firm.

6. MCP URLs are unauthenticated in the manifest

.mcp.json and mcp_servers: only declare url: — auth is implicit (per-user OAuth in Cowork; vendor-specific in CMA via env vars). A stale or hijacked MCP URL could exfiltrate the agent's tool calls. Production deployments should pin URLs to firm-controlled proxies that verify upstream identity.

7. `.mcp.json` is not validated by `check.py`

check.py validates JSON parse for marketplace.json, plugin.json, and steering-examples.json but not .mcp.json. A typo there ships silently to users.

8. Hooks scaffolded but unused

hooks/hooks.json is [] or {}. The hooks system would be a natural place to enforce "always run recalc.py after model-builder writes an .xlsx" — currently the prompt tells the LLM to do it, which is weaker than a Stop hook.

9. No behavioral tests

scripts/test-cookbooks.sh checks structural shape only. There are no recorded prompts or evaluation harnesses against which agent quality is measured. A change to a skill body that subtly breaks DCF outputs would not be caught by CI — only by a downstream user noticing.

10. Microsoft 365 setup wizard is a powerful agent

claude-for-msft-365-install/commands/setup.md walks an admin through Vertex/Bedrock/Foundry provisioning, asks for OAuth secrets in chat ("paste the Client ID when you have it"), and shells out to node/gcloud/aws. The setup is great UX but means the admin is pasting credentials into a chat window during onboarding. The flow is appropriate for the use case (admins, in their own env), but it's worth flagging that this command, by design, has tenant-admin power.

Things this repo could add

Skill drift pre-commit hook. Stop the editing-without-syncing footgun.
check.py cross-check orchestrate.py:ALLOWED_TARGETS against marketplace.json agent slugs.
.mcp.json JSON-schema validation in check.py.
A behavioral eval harness — even a small one, with golden outputs per agent for representative steering events.
Per-leaf model selection. Default reader subagents to Sonnet/Haiku; orchestrator + write-holder to Opus.
A Stop hook that runs python recalc.py automatically after model-builder writes an .xlsx.
Move the handoff format to a tool call when the platform supports it; mark orchestrate.py as deprecated.

Questions worth asking before forking

Where are your trusted vs untrusted source boundaries? The reader/orchestrator/writer split assumes the firm has a clear answer (custodian PDFs untrusted, internal GL trusted). If the firm's data sources are blended, the tier table needs redesigning.
What's the firm's workflow engine? The repo says "your Temporal/Airflow/event bus" but scripts/orchestrate.py is a 90-line example. Production should plug into an engine that owns durability, retries, and idempotency.
What's the audit trail? Cookbook READMEs say "stages for human sign-off"; the form of that sign-off (DocuSign? ticketing system? GL workflow tool?) is firm-specific and absent here.
Per-tenant isolation? All MCP URLs come from env vars in the deploy script. A multi-tenant deployment needs a session→credentials mapping outside what's shown.
Cost. Every agent uses Opus and several skills include "load full filings — do not summarize from snippets" (pitch-agent.md:21). Per-run token costs in production will be material; a finance team would want to budget per agent invocation.

What I find well-done

The repo is honest about what it is and isn't — drafts, not decisions; reference, not production; preview, not GA. Every cookbook has a "Not guaranteed" note.
The structural enforcement in check.py is thorough — every cross-file reference resolves, every bundled skill matches its source, every required cookbook file exists.
The "untrusted reader / re-verifier / write-holder" tier table is a portable security pattern other agent libraries should copy.
The Microsoft 365 setup wizard is a good demonstration of "the slash command IS the program" — markdown-driven imperative tooling for tenant-side admins.
The <common_mistakes> section in DCF skill is rare and worth borrowing in your own prompts.
Honest threat-model commentary in scripts/orchestrate.py:8-15 is exactly what you want from reference code.

What I'd push back on

Skill description as router is fragile. Need an explicit trigger registry, with overlap checks.
ALLOWED_TARGETS as a separate Python set is drift-prone.
Single-model-everywhere is wasteful.
Prose security claims ("the doc-reader has Read/Grep only and returns length-capped structured JSON") in cookbook READMEs are accurate today but not mechanically tied to the yamls. A check.py rule that verifies the "Tier" table claims against the actual subagent yamls would prevent docs and code from drifting.
No first-class hooks usage despite the scaffolding being there.

The repo is a reference catalogue. Anyone forking it should treat the pattern set (isolation tiers, schema-as-sanitizer, allowlist+validate handoffs, anti-example prompting, cite-or-flag-as-unsourced) as the deliverable, more than any specific agent.