CodeDocs Vault

AI & LLM Integration

Overview

Comp AI uses a multi-model AI strategy across 12+ distinct use cases, selecting models based on cost, speed, context window, and capability requirements. AI is not a bolt-on feature - it's core to the platform's value proposition.

Model Selection Matrix

Use Case Model Provider Temp Why This Model
Policy chat assistant claude-sonnet-4-6 Anthropic auto Complex reasoning over long policy documents
Policy section editing claude-sonnet-4-6 Anthropic auto Precise JSON structure generation
Cloud security remediation claude-opus-4-6 Anthropic 0 Deterministic IAM policy/CLI generation
PDF content extraction claude-sonnet-4-6 Anthropic auto Native PDF support (multi-page)
Browser automation claude-sonnet-4-6 Anthropic Stagehand visual navigation agent
General assistant chat gpt-5 OpenAI auto Broad knowledge, tool use
Policy generation gpt-5-mini OpenAI auto Good structure, lower cost than full GPT-5
Questionnaire parsing gpt-5-mini OpenAI auto Structured Q&A extraction
RAG answer generation gpt-4o-mini OpenAI auto Fast, cheap for short answers
Vendor risk assessment gpt-5.2 OpenAI auto Complex multi-source analysis
Auditor content generation gpt-5.2 OpenAI auto Long-form, factual business writing
Vision extraction gpt-4o OpenAI auto Image understanding
SOA answering gpt-5-mini / gpt-4o-mini OpenAI auto Structured compliance analysis
Task relevance matching llama-4-scout-17b Groq auto Ultra-fast, cheap classification
Embeddings text-embedding-3-small OpenAI Cost-effective for RAG
Fast question parsing meta-llama/gpt-oss-120b Groq auto Ultra-fast first attempt

Model Selection Philosophy

Cost axis:     Groq (cheapest) → GPT-4o-mini → GPT-5-mini → Claude Sonnet → GPT-5 → Claude Opus
Capability:    Groq (fastest)  → GPT-4o-mini → GPT-5-mini → Claude Sonnet → GPT-5 → Claude Opus
Context:       Groq (32K)      → GPT-4o-mini → GPT-5-mini → Claude (200K) → GPT-5  → Claude Opus

Rule: Use the cheapest model that can reliably do the job.
Exception: Cloud remediation uses Opus at temp 0 because wrong IAM policies = production outage.

AI System #1: RAG-Powered Questionnaire Answering

Architecture

                    Document Upload
                         │
                    ┌────▼────┐
                    │ Extract  │  mammoth (docx), exceljs (xlsx),
                    │ Content  │  unpdf (pdf), Claude vision (images)
                    └────┬────┘
                         │
                    ┌────▼────┐
                    │  Parse   │  Groq (fast) → Claude (fallback) → OpenAI
                    │Questions │  Extracts Q&A pairs from content
                    └────┬────┘
                         │
                    ┌────▼────┐
                    │ Generate │  text-embedding-3-small (OpenAI)
                    │Embeddings│  Stored in PostgreSQL via pgvector
                    └────┬────┘
                         │
              For each question:
                         │
                    ┌────▼────┐
                    │ Vector   │  Similarity search against:
                    │ Search   │  - Organization policies
                    └────┬────┘  - Context documents
                         │       - Manual answers
                    ┌────▼────┐  - Knowledge base docs
                    │   RAG    │
                    │  Answer  │  gpt-4o-mini with strict guardrails
                    └────┬────┘
                         │
                    ┌────▼────┐
                    │  Store   │  questionnaire_question_answer table
                    │ Answers  │  status: 'generated'
                    └─────────┘

Key Files

File Lines Purpose
apps/api/src/questionnaire/utils/content-extractor.ts ~1092 Multi-format file parsing
apps/api/src/questionnaire/utils/question-parser.ts ~200 AI-powered Q&A extraction
apps/api/src/questionnaire/utils/constants.ts ~100 System prompts
apps/api/src/trigger/questionnaire/answer-question-helpers.ts ~200 RAG answer generation
apps/api/src/vector-store/lib/core/generate-embedding.ts ~50 Embedding generation

Guardrails

Answer generation prompt (constants.ts):

- Answer based ONLY on the provided context
- If insufficient evidence → "N/A - no evidence found"
- Use "we/our/us" voice for the organization
- Keep answers 1-3 sentences
- Never fabricate information not in context

Parsing fallback chain:

1. Groq (meta-llama/gpt-oss-120b) - ultra-fast, 25K char chunks
2. Claude Sonnet - 200K context for large documents
3. OpenAI gpt-4o-mini - final fallback

Vector Store Implementation

// apps/api/src/vector-store/lib/core/generate-embedding.ts
// Uses OpenAI text-embedding-3-small (1536 dimensions)
// Stored in PostgreSQL via pgvector extension
 
// Sources indexed:
// - Policies (full content, chunked)
// - Context Q&A (manual answers from onboarding/settings)
// - Knowledge base documents (uploaded files)
// - Manual answer entries (human overrides for questionnaires)
 
// Search: cosine similarity with top-k results
// Batch support: batchGenerateEmbeddings() for efficiency

AI System #2: Policy Generation & Editing

Policy Generation (Trigger.dev task)

Inputs:
  - Policy template (FrameworkEditorPolicyTemplate)
  - Organization data (industry, size, tech stack)
  - Active frameworks (SOC 2, ISO 27001, etc.)
  - Context hub answers (onboarding data)

Process:
  1. Build comprehensive prompt with company info
  2. Call gpt-5-mini for TipTap JSON structure
  3. Sanitize output (remove "<<TO REVIEW>>" placeholders)
  4. Align with template structure
  5. Save to database with version tracking

Key file: apps/api/src/trigger/policies/update-policy-helpers.ts

Policy Chat Assistant (streaming)

Endpoint: POST /api/policies/[policyId]/chat
Model: claude-sonnet-4-6
Max steps: 5 (prevents runaway tool loops)

Tools available:
  - getVendors: Fetch organization's vendor list
  - getPolicies: Fetch other policies for cross-reference
  - getEvidence: Fetch related evidence
  - proposePolicy: Submit edited TipTap JSON

System prompt emphasizes:
  - "PRESERVE UNCHANGED TEXT EXACTLY"
  - Section boundary rules for headings/lists
  - TipTap JSON structure requirements
  - Prohibition on copying previous proposals

Key file: apps/app/src/app/api/policies/[policyId]/chat/route.ts

Section Editor (single-turn)

Endpoint: POST /api/policies/[policyId]/edit-section
Model: claude-sonnet-4-6
Purpose: Edit a single section without full policy context
Strips previous proposePolicy tool calls from history to prevent reuse

AI System #3: Cloud Security Remediation

Architecture

Security Finding (e.g., "S3 bucket public access enabled")
    │
    ▼
Phase 1: Generate Initial Fix Plan
    │ Model: claude-opus-4-6 (temperature: 0)
    │ Input: Finding description + cloud provider
    │ Output: { readSteps, fixSteps, rollbackSteps }
    │
    ▼
Phase 2: Execute Read Steps
    │ AWS SDK v3 command execution
    │ Gathers actual resource state
    │
    ▼
Phase 3: Refine Plan with Real Data
    │ Model: claude-opus-4-6 (temperature: 0)
    │ Input: Finding + actual AWS state
    │ Output: Refined { readSteps, fixSteps, rollbackSteps }
    │
    ▼
Execute Fix Steps (with acknowledgment)
    │ Maps step commands to AWS SDK calls
    │ Tracks: executing → success/failed/needs_permissions
    │
    ▼
Rollback Available (if fix fails)

Key Files

File Purpose
apps/api/src/cloud-security/ai-remediation.service.ts Orchestrates 2-phase fix planning
apps/api/src/cloud-security/ai-remediation.prompt.ts AWS fix plan Zod schema + prompts
apps/api/src/cloud-security/gcp-ai-remediation.prompt.ts GCP REST API fix schemas
apps/api/src/cloud-security/azure-ai-remediation.prompt.ts Azure ARM API fix schemas
apps/api/src/cloud-security/aws-command-executor.ts Maps AI output to AWS SDK calls

Why Temperature 0?

// ai-remediation.service.ts
const result = await generateObject({
  model: anthropic('claude-opus-4-6'),
  temperature: 0,  // CRITICAL: deterministic output
  // ...
});

Cloud remediation generates executable commands. A creative variation in an IAM policy could:

Temperature 0 ensures reproducible, exact outputs.

Multi-Cloud Schema Design

Each cloud provider has its own Zod schema for fix plans:

AWS: Uses SDK v3 command class names

// FixStep for AWS
{
  service: "S3",
  command: "PutPublicAccessBlock",
  params: { Bucket: "my-bucket", ... }
}

GCP: Uses REST API endpoints

// FixStep for GCP
{
  method: "PATCH",
  url: "https://storage.googleapis.com/storage/v1/b/my-bucket",
  body: { ... }
}

Azure: Uses ARM REST API

// FixStep for Azure
{
  method: "PUT",
  url: "https://management.azure.com/subscriptions/.../providers/...",
  body: { ... }
}

AI System #4: Vendor Risk Assessment

Pipeline

Vendor created/updated
    │
    ▼
Trigger.dev: vendor-risk-assessment-task
    │
    ├── Firecrawl: Scrape vendor website (core pages)
    ├── Firecrawl: Research vendor news/incidents
    │
    ▼
LLM Analysis (gpt-5.2)
    │ Input: Website content + news + existing vendor data
    │ Output: Structured risk assessment
    │   - Security posture analysis
    │   - Risk scores (low/medium/high)
    │   - Compliance certification detection
    │   - Task generation for remediation
    │
    ▼
PostgreSQL advisory lock (prevents concurrent assessment)
    │
    ▼
Save: VendorRiskAssessment with version tracking
Create: TaskItems for follow-up actions

Deduplication

// PostgreSQL advisory locks prevent concurrent vendor assessment
// Keyed by website domain hash
// Versions: v1, v2, v3... for re-runs

AI System #5: Assistant Chat (API)

Architecture

Endpoint: POST /v1/assistant-chat/completions
Model: gpt-5 (OpenAI)
Streaming: Server-sent events via streamText()
Steps limit: 5 (prevents runaway)

Tools (permission-gated per user):
  - findOrganization: Always available
  - getUser: Always available
  - getPolicies: Requires policy:read
  - getPolicyContent: Requires policy:read
  - getRisks: Requires risk:read
  - getRiskById: Requires risk:read

History: Ephemeral, stored in Upstash Redis

Permission-Gated Tool Pattern

// apps/api/src/assistant-chat/assistant-chat-tools.ts
function buildTools(permissions: UserPermissions) {
  const tools = {
    findOrganization: { ... },  // Always available
    getUser: { ... },           // Always available
  };
  
  if (hasPermission(permissions, 'policy', 'read')) {
    tools.getPolicies = { ... };
    tools.getPolicyContent = { ... };
  }
  
  if (hasPermission(permissions, 'risk', 'read')) {
    tools.getRisks = { ... };
    tools.getRiskById = { ... };
  }
  
  return tools;
}

This ensures the LLM cannot access data the user doesn't have permission to see, even through tool calls.


AI System #6: Browser Automation

Stack

Browserbase (cloud browser infrastructure)
    └── Stagehand v3 (AI browser agent)
        └── Claude Sonnet 4.6 (visual understanding)
            └── Playwright (browser protocol)

How It Works

// apps/api/src/browserbase/browserbase.service.ts
 
// 1. Create/reuse persistent browser context per org
const contextId = await getOrCreateOrgContext(orgId);
 
// 2. Create session with context
const session = await browserbase.sessions.create({
  projectId: BROWSERBASE_PROJECT_ID,
  browserSettings: { context: { id: contextId } }
});
 
// 3. Initialize Stagehand with Claude
const stagehand = new Stagehand({
  browserbaseSessionID: session.id,
  modelName: 'anthropic/claude-sonnet-4-6',
  modelClientOptions: { apiKey: ANTHROPIC_API_KEY }
});
 
// 4. Execute natural-language tasks (max 20 steps)
await stagehand.agent.execute(taskInstructions);
 
// 5. Capture screenshots → upload to S3 → return presigned URLs

Use Cases


AI System #7: Auditor Content Generation

Model: gpt-5.2
Trigger: Trigger.dev task

Generates sections:
  - Company background
  - Services provided
  - Mission & vision
  - System description
  - Critical vendors (filtered for SOC 2 relevance)
  - Subservice organizations

Data sources:
  - Organization context hub answers
  - Website scraping (if URL available)

Guardrails:
  - "NEVER mention missing information"
  - "Write about what IS available"
  - "No hedging words (may, might, likely)"
  - "No attribution phrases"

AI System #8: Task Automation Chat

Frontend Architecture

React component: chat.tsx
Framework: @ai-sdk/react useChat() hook
Transport: DefaultChatTransport → /api/tasks-automations/chat

Features:
  - Streaming with visible reasoning steps
  - Dynamic model selection via AI Gateway
  - Ephemeral → persistent automation transition
  - Tools: web search (Exa), website crawling (Firecrawl)
  - Secret injection and info context provision

Model Gateway

// apps/app/src/.../tools/gateway.ts
// AI Gateway allows runtime model selection
// User can choose model + reasoning effort
// Reasoning effort: minimal | low | medium

Guardrails & Safety Patterns

1. Step Limiting

// Prevents runaway tool calling loops
streamText({
  maxSteps: 5,
  // or
  stopCondition: stepCountIs(5),
});

Used in: policy chat, assistant chat, section editor

2. Temperature Control

// Deterministic outputs for safety-critical operations
temperature: 0  // Cloud remediation (IAM policies, CLI commands)
temperature: auto  // Creative tasks (policy writing, chat)

3. Zod Schema Validation

// All structured LLM outputs validated before use
const result = await generateObject({
  schema: fixPlanSchema,  // Zod schema
  // ...
});
// Invalid outputs throw NoObjectGeneratedError → fallback handling

4. Context Grounding (RAG)

"Answer based ONLY on the provided context"
"If insufficient → respond 'N/A - no evidence found'"

Prevents hallucination in questionnaire answers and SOA responses.

5. Content Truncation

// Groq: 25K char chunks (32K context limit)
// General parsing: 80K char chunks
// Vision models: document slicing at 80K chars

6. Permission-Gated Tools

// Assistant chat tools filtered by user permissions
// LLM can only call tools the user has access to

7. Fallback Chains

Groq (fast/cheap) → Claude (large context) → OpenAI (reliable)

Resilient parsing even when primary provider is down.

8. Error Handling

// NoObjectGeneratedError: Special handling with JSON.parse() fallback
// Missing API keys: Returns 503 (not 500)
// Browserbase failures: Actionable error messages
// Vendor assessment: 2-attempt retry with advisory locks

Cost Optimization Patterns

1. Batch Operations

// Instead of N embedding calls, batch them
batchGenerateEmbeddings(texts);  // Uses embedMany()
generateAnswerWithRAGBatch();    // Pre-fetch vectors + parallel LLM
batchSearchSOAQuestions();       // Pre-fetch all control vectors

2. Model Tiering

Simple classification → Groq Llama (cents)
Structured extraction → GPT-4o-mini (pennies)
Complex reasoning → Claude Sonnet (dimes)
Safety-critical → Claude Opus (dollars)

3. Streaming Responses

Policy chat and assistant use streamText() for:

4. Chunking Strategy

Small docs: Single LLM call
Large docs (>25K): Chunk and process in parallel
Huge docs (>80K): Use Claude 200K context as fallback
Images/PDFs: Vision models for extraction

Prompt Engineering Patterns

Role Establishment

"You are an expert in GRC (Governance, Risk, and Compliance)"
"You are a helpful assistant in Comp AI"

Structured Output Instructions

"Return a JSON object with the following structure..."
"Use TipTap JSON format for policy content"
"Generate AWS SDK v3 command names, not CLI commands"

Negative Instructions (What NOT to Do)

"NEVER mention missing information"
"Do NOT use general knowledge"
"NEVER hallucinate data not in context"
"Do NOT copy previous policy proposals"

Voice & Tone Control

"Use 'we/our/us' voice for the organization"
"No hedging words (may, might, likely)"
"Keep answers 1-3 sentences"

Context Injection

"Current date: {date}"
"Organization: {name} in {industry}"
"Active frameworks: {frameworks}"
"Company size: {size} employees"

What's Notable

Strengths

  1. Model diversity - Not locked to one provider; uses the right model for each task
  2. RAG is core, not afterthought - Vector store deeply integrated with compliance workflow
  3. Permission-aware AI - Tools respect RBAC, preventing data leakage through AI
  4. Deterministic where it matters - Temperature 0 for cloud remediation prevents dangerous creative outputs
  5. Fallback chains - Graceful degradation across LLM providers
  6. Audit trail includes AI - Generated answers tracked separately from manual ones

Potential Improvements

  1. No token counting or budget enforcement - No visible per-org cost tracking
  2. No content filtering layer - Relies on model-level safety, no explicit input/output filtering
  3. No prompt injection defense - User-uploaded documents feed directly into prompts
  4. Embedding model is basic - text-embedding-3-small may miss nuance; no reranking step
  5. No A/B testing of models - Model selection is hardcoded, not experimentally validated
  6. No caching of LLM responses - Repeated identical queries hit the API each time