05 - AI & LLM Architecture

Models

Model	Provider	Usage
`gpt-5`	OpenAI	AI chat assistant, questionnaire answering, vendor assessment
`text-embedding-3-small`	OpenAI	Vector embeddings for semantic search
Anthropic models	Anthropic	Alternative LLM provider (via AI SDK)
Groq models	Groq	Alternative LLM provider (via AI SDK)

The specific model used for questionnaire answering is configured via the ANSWER_MODEL constant in apps/api/src/questionnaire/utils/constants.ts.

Framework: Vercel AI SDK

All LLM interactions use the Vercel AI SDK (ai package) for a unified multi-provider abstraction:

import { openai } from '@ai-sdk/openai';
import { streamText, generateText, embed, embedMany } from 'ai';

Key functions used:

streamText() — Streaming chat responses with tool calling
generateText() — Single-shot text generation (RAG answers, policy content)
embed() — Single embedding generation
embedMany() — Batch embedding generation

Provider packages: @ai-sdk/openai, @ai-sdk/anthropic, @ai-sdk/groq

RAG Pipeline (Retrieval-Augmented Generation)

File: apps/api/src/trigger/questionnaire/answer-question-helpers.ts

The RAG pipeline answers security questionnaire questions using organization-specific data:

Question input
  │
  ├─ 1. Generate embedding (text-embedding-3-small)
  │
  ├─ 2. Vector search (Upstash Vector)
  │     → Filter: organizationId
  │     → Top 100 results
  │     → Min similarity: 0.2
  │     → Sources: policies, context Q&A, knowledge base docs, manual answers
  │
  ├─ 3. Deduplicate sources
  │     → Group by sourceType + sourceId
  │     → Keep highest-scoring per source
  │
  ├─ 4. Build context string
  │     → "[1] Source: Policy 'Access Control'\n<content>"
  │     → "[2] Source: Context Q&A\n<content>"
  │     → ...
  │
  ├─ 5. LLM generation
  │     → System prompt: ANSWER_SYSTEM_PROMPT
  │     → User prompt: "Based on the following context... Answer ONLY from provided context"
  │     → First person plural (we, our, us)
  │
  └─ 6. Post-processing
        → Detect "N/A - no evidence found" → return null
        → Return answer + attributed sources with scores

Batch optimization: generateAnswerWithRAGBatch() processes multiple questions by:

Generating all embeddings in a single embedMany() call
Running vector searches in parallel (Promise.all)
Running LLM generations in parallel

Embedding Generation

File: apps/api/src/vector-store/lib/core/generate-embedding.ts

Two functions for embedding generation:

// Single embedding
export async function generateEmbedding(text: string): Promise<number[]> {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: text,
  });
  return embedding;
}
 
// Batch embeddings (single API call for N texts)
export async function batchGenerateEmbeddings(texts: string[]): Promise<number[][]> {
  const { embeddings } = await embedMany({
    model: openai.embedding('text-embedding-3-small'),
    values: validTexts.map(v => v.text),
  });
  return result;
}

Embeddings are stored in Upstash Vector with metadata:

organizationId — for server-side filtering
sourceType — policy, context, knowledge_base_document, manual_answer, attachment
sourceId — reference to the source record
content — the embedded text chunk
policyName, contextQuestion, documentName — human-readable labels

Vector Store Sync

Directory: apps/api/src/vector-store/lib/sync/

The vector store is kept in sync with source data via incremental sync pipelines:

Sync Pipeline	Source Data	Trigger
`sync-policies.ts`	Policy documents (TipTap JSON → text extraction)	Org sync, policy update
`sync-context.ts`	Organization context Q&A pairs	Org sync, context update
`sync-knowledge-base.ts`	Uploaded knowledge base documents	Document upload
`sync-manual-answer.ts`	Manual questionnaire answers	Answer creation/update

syncOrganizationEmbeddings() orchestrates a full incremental sync for an organization, running before questionnaire answering to ensure fresh data.

Text is chunked via chunk-text.ts before embedding, and policy text is extracted from TipTap JSON via extract-policy-text.ts.

AI Chat Assistant

File: apps/app/src/app/api/chat/route.ts

The chat endpoint provides a streaming AI assistant for compliance guidance:

const result = streamText({
  model: openai('gpt-5'),
  system: systemPrompt,
  messages: convertToModelMessages(messages),
  tools,
});
 
return result.toUIMessageStreamResponse();

System prompt includes:

GRC expert persona
Organization context (organizationId)
Current date/time
Instruction to prefer tool calls over guessing
Markdown formatting constraints

Tools (apps/app/src/data/tools/):

Tool	File	Purpose
Organization	`organization.ts`	Fetch org details, members, settings
Policies	`policies.ts`	Query policy documents and status
Risks	`risks-tool.ts`	Retrieve risk register entries
User	`user.ts`	Get current user info and permissions

These tools let the LLM fetch live organization data during conversations. The response is streamed to the client via toUIMessageStreamResponse().

Auth: Session-based (Better Auth cookies). Validates user membership in the specified organization before processing.

Policy Generation

File: apps/app/src/trigger/lib/prompts.ts

Policies are generated by editing TipTap JSON templates with LLM assistance:

Template (TipTap JSON with placeholders)
  + Company info (name, website)
  + Framework context (SOC2, HIPAA flags)
  + Knowledge base Q&A (Context Hub)
  → LLM generates final TipTap JSON

Placeholder system:

Placeholder	Maps to
`{{COMPANY}}`	Company name
`{{COMPANYINFO}}`	Company description
`{{INDUSTRY}}`	Industry sector
`{{EMPLOYEES}}`	Employee count
`{{DEVICES}}`	Team device types
`{{SOFTWARE}}`	Software used
`{{LOCATION}}`	Work arrangement
`{{CRITICAL}}`	Hosting/infrastructure
`{{DATA}}`	Data types handled
`{{GEO}}`	Data location

Handlebars-style conditionals:

{{#if soc2}}
  SOC 2-specific policy content...
{{/if}}

{{#if hipaa}}
  HIPAA-specific policy content...
{{/if}}

The LLM evaluates these conditionals based on the organization's selected frameworks and removes unmatched blocks.

Vendor Risk Assessment

Vendor risk assessment combines web scraping with structured LLM output:

Research (apps/app/src/trigger/lib/research.ts):
- Submits vendor website URL to Firecrawl API (/v1/extract)
- Polls for completion (5-second intervals, 5-minute timeout)
- Validates extracted data against Zod schema
- Options for onlyMainContent and removeBase64Images
Assessment (apps/app/src/trigger/tasks/onboarding/generate-vendor-mitigation.ts):
- LLM generates structured risk assessment using Zod output schema
- Includes risk scoring, categorization, and mitigation recommendations
- Uses pg_advisory_lock for concurrent write safety
- Saves assessment to database

The RAG pipeline instructs the LLM to "answer ONLY from the provided context" and use first person plural (we, our, us). If context is insufficient, the LLM must respond with "N/A - no evidence found".

Output Validation

Zod schemas validate structured LLM output (vendor assessments, research extraction)
Firecrawl responses are validated at both the job status level and the final data level

Source Attribution

Every RAG answer includes attributed sources with similarity scores. Sources are deduplicated by sourceType + sourceId, keeping the highest-scoring entry per source.

No-Evidence Detection

function isNoEvidenceAnswer(answer: string): boolean {
  const lowerAnswer = answer.toLowerCase();
  return (
    lowerAnswer.includes('n/a') ||
    lowerAnswer.includes('no evidence') ||
    lowerAnswer.includes('not found in the context')
  );
}

Answers detected as "no evidence" are returned as null with empty sources, preventing hallucinated responses from reaching users.

Rate Limiting

The NestJS API applies global rate limiting via ThrottlerModule:

100 requests per 60 seconds per IP address
Applied as a global APP_GUARD

Similarity Threshold

Vector search uses a minimum similarity score of 0.2 to filter noise while maintaining high recall. Results below this threshold are discarded before context building.

Request Limits

Chat endpoint: maxDuration = 30 seconds
Body parser: 150 MB limit (for base64 file uploads)
Firecrawl polling: 5-minute timeout with 5-second intervals