AI & LLM Integration

Overview

Comp AI uses a multi-model AI strategy across 12+ distinct use cases, selecting models based on cost, speed, context window, and capability requirements. AI is not a bolt-on feature - it's core to the platform's value proposition.

Model Selection Matrix

Use Case	Model	Provider	Temp	Why This Model
Policy chat assistant	`claude-sonnet-4-6`	Anthropic	auto	Complex reasoning over long policy documents
Policy section editing	`claude-sonnet-4-6`	Anthropic	auto	Precise JSON structure generation
Cloud security remediation	`claude-opus-4-6`	Anthropic	0	Deterministic IAM policy/CLI generation
PDF content extraction	`claude-sonnet-4-6`	Anthropic	auto	Native PDF support (multi-page)
Browser automation	`claude-sonnet-4-6`	Anthropic	—	Stagehand visual navigation agent
General assistant chat	`gpt-5`	OpenAI	auto	Broad knowledge, tool use
Policy generation	`gpt-5-mini`	OpenAI	auto	Good structure, lower cost than full GPT-5
Questionnaire parsing	`gpt-5-mini`	OpenAI	auto	Structured Q&A extraction
RAG answer generation	`gpt-4o-mini`	OpenAI	auto	Fast, cheap for short answers
Vendor risk assessment	`gpt-5.2`	OpenAI	auto	Complex multi-source analysis
Auditor content generation	`gpt-5.2`	OpenAI	auto	Long-form, factual business writing
Vision extraction	`gpt-4o`	OpenAI	auto	Image understanding
SOA answering	`gpt-5-mini` / `gpt-4o-mini`	OpenAI	auto	Structured compliance analysis
Task relevance matching	`llama-4-scout-17b`	Groq	auto	Ultra-fast, cheap classification
Embeddings	`text-embedding-3-small`	OpenAI	—	Cost-effective for RAG
Fast question parsing	`meta-llama/gpt-oss-120b`	Groq	auto	Ultra-fast first attempt

Model Selection Philosophy

Cost axis:     Groq (cheapest) → GPT-4o-mini → GPT-5-mini → Claude Sonnet → GPT-5 → Claude Opus
Capability:    Groq (fastest)  → GPT-4o-mini → GPT-5-mini → Claude Sonnet → GPT-5 → Claude Opus
Context:       Groq (32K)      → GPT-4o-mini → GPT-5-mini → Claude (200K) → GPT-5  → Claude Opus

Rule: Use the cheapest model that can reliably do the job.
Exception: Cloud remediation uses Opus at temp 0 because wrong IAM policies = production outage.

AI System #1: RAG-Powered Questionnaire Answering

Architecture

                    Document Upload
                         │
                    ┌────▼────┐
                    │ Extract  │  mammoth (docx), exceljs (xlsx),
                    │ Content  │  unpdf (pdf), Claude vision (images)
                    └────┬────┘
                         │
                    ┌────▼────┐
                    │  Parse   │  Groq (fast) → Claude (fallback) → OpenAI
                    │Questions │  Extracts Q&A pairs from content
                    └────┬────┘
                         │
                    ┌────▼────┐
                    │ Generate │  text-embedding-3-small (OpenAI)
                    │Embeddings│  Stored in PostgreSQL via pgvector
                    └────┬────┘
                         │
              For each question:
                         │
                    ┌────▼────┐
                    │ Vector   │  Similarity search against:
                    │ Search   │  - Organization policies
                    └────┬────┘  - Context documents
                         │       - Manual answers
                    ┌────▼────┐  - Knowledge base docs
                    │   RAG    │
                    │  Answer  │  gpt-4o-mini with strict guardrails
                    └────┬────┘
                         │
                    ┌────▼────┐
                    │  Store   │  questionnaire_question_answer table
                    │ Answers  │  status: 'generated'
                    └─────────┘

Key Files

File	Lines	Purpose
`apps/api/src/questionnaire/utils/content-extractor.ts`	~1092	Multi-format file parsing
`apps/api/src/questionnaire/utils/question-parser.ts`	~200	AI-powered Q&A extraction
`apps/api/src/questionnaire/utils/constants.ts`	~100	System prompts
`apps/api/src/trigger/questionnaire/answer-question-helpers.ts`	~200	RAG answer generation
`apps/api/src/vector-store/lib/core/generate-embedding.ts`	~50	Embedding generation

Guardrails

Answer generation prompt (constants.ts):

- Answer based ONLY on the provided context
- If insufficient evidence → "N/A - no evidence found"
- Use "we/our/us" voice for the organization
- Keep answers 1-3 sentences
- Never fabricate information not in context

Parsing fallback chain:

1. Groq (meta-llama/gpt-oss-120b) - ultra-fast, 25K char chunks
2. Claude Sonnet - 200K context for large documents
3. OpenAI gpt-4o-mini - final fallback

Vector Store Implementation

// apps/api/src/vector-store/lib/core/generate-embedding.ts
// Uses OpenAI text-embedding-3-small (1536 dimensions)
// Stored in PostgreSQL via pgvector extension
 
// Sources indexed:
// - Policies (full content, chunked)
// - Context Q&A (manual answers from onboarding/settings)
// - Knowledge base documents (uploaded files)
// - Manual answer entries (human overrides for questionnaires)
 
// Search: cosine similarity with top-k results
// Batch support: batchGenerateEmbeddings() for efficiency

AI System #2: Policy Generation & Editing

Policy Generation (Trigger.dev task)

Inputs:
  - Policy template (FrameworkEditorPolicyTemplate)
  - Organization data (industry, size, tech stack)
  - Active frameworks (SOC 2, ISO 27001, etc.)
  - Context hub answers (onboarding data)

Process:
  1. Build comprehensive prompt with company info
  2. Call gpt-5-mini for TipTap JSON structure
  3. Sanitize output (remove "<<TO REVIEW>>" placeholders)
  4. Align with template structure
  5. Save to database with version tracking

Key file: apps/api/src/trigger/policies/update-policy-helpers.ts

Policy Chat Assistant (streaming)

Endpoint: POST /api/policies/[policyId]/chat
Model: claude-sonnet-4-6
Max steps: 5 (prevents runaway tool loops)

Tools available:
  - getVendors: Fetch organization's vendor list
  - getPolicies: Fetch other policies for cross-reference
  - getEvidence: Fetch related evidence
  - proposePolicy: Submit edited TipTap JSON

System prompt emphasizes:
  - "PRESERVE UNCHANGED TEXT EXACTLY"
  - Section boundary rules for headings/lists
  - TipTap JSON structure requirements
  - Prohibition on copying previous proposals

Key file: apps/app/src/app/api/policies/[policyId]/chat/route.ts

Section Editor (single-turn)

Endpoint: POST /api/policies/[policyId]/edit-section
Model: claude-sonnet-4-6
Purpose: Edit a single section without full policy context
Strips previous proposePolicy tool calls from history to prevent reuse

AI System #3: Cloud Security Remediation

Architecture

Security Finding (e.g., "S3 bucket public access enabled")
    │
    ▼
Phase 1: Generate Initial Fix Plan
    │ Model: claude-opus-4-6 (temperature: 0)
    │ Input: Finding description + cloud provider
    │ Output: { readSteps, fixSteps, rollbackSteps }
    │
    ▼
Phase 2: Execute Read Steps
    │ AWS SDK v3 command execution
    │ Gathers actual resource state
    │
    ▼
Phase 3: Refine Plan with Real Data
    │ Model: claude-opus-4-6 (temperature: 0)
    │ Input: Finding + actual AWS state
    │ Output: Refined { readSteps, fixSteps, rollbackSteps }
    │
    ▼
Execute Fix Steps (with acknowledgment)
    │ Maps step commands to AWS SDK calls
    │ Tracks: executing → success/failed/needs_permissions
    │
    ▼
Rollback Available (if fix fails)

Key Files

File	Purpose
`apps/api/src/cloud-security/ai-remediation.service.ts`	Orchestrates 2-phase fix planning
`apps/api/src/cloud-security/ai-remediation.prompt.ts`	AWS fix plan Zod schema + prompts
`apps/api/src/cloud-security/gcp-ai-remediation.prompt.ts`	GCP REST API fix schemas
`apps/api/src/cloud-security/azure-ai-remediation.prompt.ts`	Azure ARM API fix schemas
`apps/api/src/cloud-security/aws-command-executor.ts`	Maps AI output to AWS SDK calls

Why Temperature 0?

// ai-remediation.service.ts
const result = await generateObject({
  model: anthropic('claude-opus-4-6'),
  temperature: 0,  // CRITICAL: deterministic output
  // ...
});

Cloud remediation generates executable commands. A creative variation in an IAM policy could:

Grant excessive permissions (security risk)
Block legitimate access (operational risk)
Produce invalid JSON (execution failure)

Temperature 0 ensures reproducible, exact outputs.

Multi-Cloud Schema Design

Each cloud provider has its own Zod schema for fix plans:

AWS: Uses SDK v3 command class names

// FixStep for AWS
{
  service: "S3",
  command: "PutPublicAccessBlock",
  params: { Bucket: "my-bucket", ... }
}

GCP: Uses REST API endpoints

// FixStep for GCP
{
  method: "PATCH",
  url: "https://storage.googleapis.com/storage/v1/b/my-bucket",
  body: { ... }
}

Azure: Uses ARM REST API

// FixStep for Azure
{
  method: "PUT",
  url: "https://management.azure.com/subscriptions/.../providers/...",
  body: { ... }
}

AI System #4: Vendor Risk Assessment

Pipeline

Vendor created/updated
    │
    ▼
Trigger.dev: vendor-risk-assessment-task
    │
    ├── Firecrawl: Scrape vendor website (core pages)
    ├── Firecrawl: Research vendor news/incidents
    │
    ▼
LLM Analysis (gpt-5.2)
    │ Input: Website content + news + existing vendor data
    │ Output: Structured risk assessment
    │   - Security posture analysis
    │   - Risk scores (low/medium/high)
    │   - Compliance certification detection
    │   - Task generation for remediation
    │
    ▼
PostgreSQL advisory lock (prevents concurrent assessment)
    │
    ▼
Save: VendorRiskAssessment with version tracking
Create: TaskItems for follow-up actions

Deduplication

// PostgreSQL advisory locks prevent concurrent vendor assessment
// Keyed by website domain hash
// Versions: v1, v2, v3... for re-runs

AI System #5: Assistant Chat (API)

Architecture

Endpoint: POST /v1/assistant-chat/completions
Model: gpt-5 (OpenAI)
Streaming: Server-sent events via streamText()
Steps limit: 5 (prevents runaway)

Tools (permission-gated per user):
  - findOrganization: Always available
  - getUser: Always available
  - getPolicies: Requires policy:read
  - getPolicyContent: Requires policy:read
  - getRisks: Requires risk:read
  - getRiskById: Requires risk:read

History: Ephemeral, stored in Upstash Redis

Permission-Gated Tool Pattern

// apps/api/src/assistant-chat/assistant-chat-tools.ts
function buildTools(permissions: UserPermissions) {
  const tools = {
    findOrganization: { ... },  // Always available
    getUser: { ... },           // Always available
  };
  
  if (hasPermission(permissions, 'policy', 'read')) {
    tools.getPolicies = { ... };
    tools.getPolicyContent = { ... };
  }
  
  if (hasPermission(permissions, 'risk', 'read')) {
    tools.getRisks = { ... };
    tools.getRiskById = { ... };
  }
  
  return tools;
}

This ensures the LLM cannot access data the user doesn't have permission to see, even through tool calls.

AI System #6: Browser Automation

Stack

Browserbase (cloud browser infrastructure)
    └── Stagehand v3 (AI browser agent)
        └── Claude Sonnet 4.6 (visual understanding)
            └── Playwright (browser protocol)

How It Works

// apps/api/src/browserbase/browserbase.service.ts
 
// 1. Create/reuse persistent browser context per org
const contextId = await getOrCreateOrgContext(orgId);
 
// 2. Create session with context
const session = await browserbase.sessions.create({
  projectId: BROWSERBASE_PROJECT_ID,
  browserSettings: { context: { id: contextId } }
});
 
// 3. Initialize Stagehand with Claude
const stagehand = new Stagehand({
  browserbaseSessionID: session.id,
  modelName: 'anthropic/claude-sonnet-4-6',
  modelClientOptions: { apiKey: ANTHROPIC_API_KEY }
});
 
// 4. Execute natural-language tasks (max 20 steps)
await stagehand.agent.execute(taskInstructions);
 
// 5. Capture screenshots → upload to S3 → return presigned URLs

Use Cases

Evidence collection from SaaS dashboards
Automated compliance checks that require browser interaction
Login verification and status checks

AI System #7: Auditor Content Generation

Model: gpt-5.2
Trigger: Trigger.dev task

Generates sections:
  - Company background
  - Services provided
  - Mission & vision
  - System description
  - Critical vendors (filtered for SOC 2 relevance)
  - Subservice organizations

Data sources:
  - Organization context hub answers
  - Website scraping (if URL available)

Guardrails:
  - "NEVER mention missing information"
  - "Write about what IS available"
  - "No hedging words (may, might, likely)"
  - "No attribution phrases"

AI System #8: Task Automation Chat

Frontend Architecture

React component: chat.tsx
Framework: @ai-sdk/react useChat() hook
Transport: DefaultChatTransport → /api/tasks-automations/chat

Features:
  - Streaming with visible reasoning steps
  - Dynamic model selection via AI Gateway
  - Ephemeral → persistent automation transition
  - Tools: web search (Exa), website crawling (Firecrawl)
  - Secret injection and info context provision

Model Gateway

// apps/app/src/.../tools/gateway.ts
// AI Gateway allows runtime model selection
// User can choose model + reasoning effort
// Reasoning effort: minimal | low | medium

Guardrails & Safety Patterns

1. Step Limiting

// Prevents runaway tool calling loops
streamText({
  maxSteps: 5,
  // or
  stopCondition: stepCountIs(5),
});

Used in: policy chat, assistant chat, section editor

2. Temperature Control

// Deterministic outputs for safety-critical operations
temperature: 0  // Cloud remediation (IAM policies, CLI commands)
temperature: auto  // Creative tasks (policy writing, chat)

3. Zod Schema Validation

// All structured LLM outputs validated before use
const result = await generateObject({
  schema: fixPlanSchema,  // Zod schema
  // ...
});
// Invalid outputs throw NoObjectGeneratedError → fallback handling

4. Context Grounding (RAG)

"Answer based ONLY on the provided context"
"If insufficient → respond 'N/A - no evidence found'"

Prevents hallucination in questionnaire answers and SOA responses.

5. Content Truncation

// Groq: 25K char chunks (32K context limit)
// General parsing: 80K char chunks
// Vision models: document slicing at 80K chars

6. Permission-Gated Tools

// Assistant chat tools filtered by user permissions
// LLM can only call tools the user has access to

7. Fallback Chains

Groq (fast/cheap) → Claude (large context) → OpenAI (reliable)

Resilient parsing even when primary provider is down.

8. Error Handling

// NoObjectGeneratedError: Special handling with JSON.parse() fallback
// Missing API keys: Returns 503 (not 500)
// Browserbase failures: Actionable error messages
// Vendor assessment: 2-attempt retry with advisory locks

Cost Optimization Patterns

1. Batch Operations

// Instead of N embedding calls, batch them
batchGenerateEmbeddings(texts);  // Uses embedMany()
generateAnswerWithRAGBatch();    // Pre-fetch vectors + parallel LLM
batchSearchSOAQuestions();       // Pre-fetch all control vectors

2. Model Tiering

Simple classification → Groq Llama (cents)
Structured extraction → GPT-4o-mini (pennies)
Complex reasoning → Claude Sonnet (dimes)
Safety-critical → Claude Opus (dollars)

3. Streaming Responses

Policy chat and assistant use streamText() for:

Better perceived latency (first token fast)
User can cancel early (saves tokens)
Progressive rendering

4. Chunking Strategy

Small docs: Single LLM call
Large docs (>25K): Chunk and process in parallel
Huge docs (>80K): Use Claude 200K context as fallback
Images/PDFs: Vision models for extraction

Prompt Engineering Patterns

Role Establishment

"You are an expert in GRC (Governance, Risk, and Compliance)"
"You are a helpful assistant in Comp AI"

Structured Output Instructions

"Return a JSON object with the following structure..."
"Use TipTap JSON format for policy content"
"Generate AWS SDK v3 command names, not CLI commands"

Negative Instructions (What NOT to Do)

"NEVER mention missing information"
"Do NOT use general knowledge"
"NEVER hallucinate data not in context"
"Do NOT copy previous policy proposals"

Voice & Tone Control

"Use 'we/our/us' voice for the organization"
"No hedging words (may, might, likely)"
"Keep answers 1-3 sentences"

Context Injection

"Current date: {date}"
"Organization: {name} in {industry}"
"Active frameworks: {frameworks}"
"Company size: {size} employees"

What's Notable

Strengths

Model diversity - Not locked to one provider; uses the right model for each task
RAG is core, not afterthought - Vector store deeply integrated with compliance workflow
Permission-aware AI - Tools respect RBAC, preventing data leakage through AI
Deterministic where it matters - Temperature 0 for cloud remediation prevents dangerous creative outputs
Fallback chains - Graceful degradation across LLM providers
Audit trail includes AI - Generated answers tracked separately from manual ones

Potential Improvements

No token counting or budget enforcement - No visible per-org cost tracking
No content filtering layer - Relies on model-level safety, no explicit input/output filtering
No prompt injection defense - User-uploaded documents feed directly into prompts
Embedding model is basic - text-embedding-3-small may miss nuance; no reranking step
No A/B testing of models - Model selection is hardcoded, not experimentally validated
No caching of LLM responses - Repeated identical queries hit the API each time