CodeDocs Vault

Benchmark Guide

This guide explains how to use the AIGovHub benchmark dataset to evaluate detection accuracy.

Overview

The benchmark dataset allows you to:

Benchmark Dataset

Location

benchmark/
├── repos.yaml          # Repository manifest with labels
├── cached/             # Cloned repos (gitignored)
├── results/            # Evaluation results
└── analysis/           # Analysis notebooks

Dataset Format

# benchmark/repos.yaml
schema_version: "1.0.0"
last_updated: "2025-01-15"
description: "Curated dataset for evaluating AI detection accuracy"
 
targets:
  accuracy: 0.90
  precision: 0.85
  recall: 0.95
  f1_score: 0.90
 
repositories:
  # Known AI repository
  - url: "https://github.com/huggingface/transformers"
    name: "transformers"
    labels:
      contains_ai: true
      ai_type: ["deep_learning", "nlp"]
      sector: "general_ml"
      risk_category: "minimal_risk"
      frameworks: ["pytorch", "tensorflow"]
    notes: "HuggingFace Transformers library"
 
  # Known non-AI repository
  - url: "https://github.com/pallets/flask"
    name: "flask"
    labels:
      contains_ai: false
      sector: "web_development"
      risk_category: null
    notes: "Python web framework - no AI components"

Label Fields

Field Type Description
contains_ai boolean Ground truth: does repo contain AI?
ai_type string[] Types of AI present
sector string Application domain
risk_category string EU AI Act risk category
frameworks string[] AI frameworks used
notes string Human notes

Running Benchmarks

Basic Benchmark

# Run benchmark with default dataset
aigovhub benchmark
 
# Use custom dataset
aigovhub benchmark --dataset my-repos.yaml
 
# Output results to specific directory
aigovhub benchmark --output results/run-001/

Manual Benchmark Process

Until the benchmark command is fully implemented, you can run benchmarks manually:

#!/bin/bash
# benchmark_run.sh
 
REPOS_FILE="benchmark/repos.yaml"
RESULTS_DIR="benchmark/results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"
 
# Parse repos from YAML and scan each
# (Requires yq or similar YAML parser)
 
for repo in $(cat repos.txt); do
    echo "Scanning $repo..."
 
    # Clone if not cached
    if [ ! -d "benchmark/cached/$repo" ]; then
        git clone "https://github.com/$repo" "benchmark/cached/$repo"
    fi
 
    # Scan
    aigovhub scan "benchmark/cached/$repo" \
        --output "$RESULTS_DIR/$repo.yaml" \
        --no-llm
 
done
 
# Evaluate results
python scripts/evaluate_benchmark.py "$RESULTS_DIR"

Python Benchmark Script

"""benchmark_evaluate.py - Evaluate benchmark results."""
 
import yaml
from pathlib import Path
from dataclasses import dataclass
 
@dataclass
class BenchmarkResult:
    true_positives: int = 0
    true_negatives: int = 0
    false_positives: int = 0
    false_negatives: int = 0
 
    @property
    def accuracy(self) -> float:
        total = self.tp + self.tn + self.fp + self.fn
        return (self.tp + self.tn) / total if total else 0
 
    @property
    def precision(self) -> float:
        return self.tp / (self.tp + self.fp) if (self.tp + self.fp) else 0
 
    @property
    def recall(self) -> float:
        return self.tp / (self.tp + self.fn) if (self.tp + self.fn) else 0
 
    @property
    def f1_score(self) -> float:
        p, r = self.precision, self.recall
        return 2 * p * r / (p + r) if (p + r) else 0
 
def evaluate(repos_file: Path, results_dir: Path) -> BenchmarkResult:
    """Evaluate benchmark results against ground truth."""
    result = BenchmarkResult()
 
    # Load ground truth
    with open(repos_file) as f:
        dataset = yaml.safe_load(f)
 
    for repo in dataset["repositories"]:
        name = repo["name"]
        expected_ai = repo["labels"]["contains_ai"]
 
        # Load scan result
        result_file = results_dir / f"{name}.yaml"
        if not result_file.exists():
            print(f"Warning: No result for {name}")
            continue
 
        with open(result_file) as f:
            scan_result = yaml.safe_load(f)
 
        detected_ai = len(scan_result.get("ai_systems", [])) > 0
 
        # Update metrics
        if expected_ai and detected_ai:
            result.true_positives += 1
        elif not expected_ai and not detected_ai:
            result.true_negatives += 1
        elif not expected_ai and detected_ai:
            result.false_positives += 1
            print(f"False positive: {name}")
        else:  # expected_ai and not detected_ai
            result.false_negatives += 1
            print(f"False negative: {name}")
 
    return result
 
if __name__ == "__main__":
    import sys
    results_dir = Path(sys.argv[1])
    result = evaluate(Path("benchmark/repos.yaml"), results_dir)
 
    print(f"\nBenchmark Results:")
    print(f"  Accuracy:  {result.accuracy:.1%}")
    print(f"  Precision: {result.precision:.1%}")
    print(f"  Recall:    {result.recall:.1%}")
    print(f"  F1 Score:  {result.f1_score:.1%}")

Metrics

Definitions

Metric Formula Description
Accuracy (TP + TN) / Total Overall correctness
Precision TP / (TP + FP) How many detections are correct
Recall TP / (TP + FN) How many AI systems are found
F1 Score 2 × (P × R) / (P + R) Harmonic mean of precision and recall

Confusion Matrix

                    Predicted
                    AI    Non-AI
Actual  AI          TP      FN
        Non-AI      FP      TN

Where:

Target Metrics

Metric Target Rationale
Accuracy >90% General correctness
Precision >85% Minimize false alarms
Recall >95% Critical: Don't miss AI systems
F1 Score >90% Balanced performance

Note: High recall is prioritized because missing an AI system (false negative) has greater compliance implications than a false positive.

Curating the Dataset

Repository Selection Criteria

  1. Diversity: Cover multiple AI types and sectors
  2. Clarity: Clear AI vs non-AI distinction
  3. Size: Meaningful codebases (not trivial demos)
  4. Activity: Recently maintained
  5. License: Allows analysis

Finding AI Repositories

GitHub search queries:

# Deep learning
topic:deep-learning stars:>100

# NLP
topic:nlp topic:transformers stars:>50

# Computer vision
topic:computer-vision topic:pytorch stars:>50

# LLM integrations
topic:langchain OR topic:llm stars:>50

# Sector-specific
topic:healthcare topic:ml
topic:fintech topic:ai

Finding Non-AI Repositories

# Web frameworks
topic:web-framework -topic:ai -topic:ml stars:>1000

# Developer tools
topic:cli -topic:ai stars:>500

# Libraries
topic:library -topic:machine-learning stars:>500

Labeling Process

For each repository:

  1. Clone and inspect the codebase
  2. Check dependencies for AI libraries
  3. Look for model files (.pt, .onnx, etc.)
  4. Review documentation for AI mentions
  5. Assign labels based on evidence
# Template for new repository
- url: "https://github.com/org/repo"
  name: "repo"
  labels:
    contains_ai: true|false
    ai_type: []              # If contains_ai: true
    sector: ""
    risk_category: null      # or specific category
    frameworks: []
  notes: "Brief description"

Analyzing Results

Identifying Patterns

After running benchmarks, analyze failures:

# Find common patterns in false negatives
for repo in false_negatives:
    print(f"Missed: {repo['name']}")
    print(f"  AI Types: {repo['labels']['ai_type']}")
    print(f"  Frameworks: {repo['labels']['frameworks']}")

Questions to ask:

Improving Detection

Based on analysis:

  1. Add libraries to constants.py if frameworks are missed
  2. Add patterns to code detection if patterns are missed
  3. Adjust thresholds if confidence is miscalibrated
  4. Add test cases for identified edge cases

Tracking Progress

Store benchmark results over time:

# benchmark/results/summary.yaml
runs:
  - date: "2025-01-15"
    version: "0.1.0"
    config:
      threshold: 0.7
      llm: false
    results:
      accuracy: 0.92
      precision: 0.88
      recall: 0.96
      f1: 0.92
    notes: "Added transformers detection"
 
  - date: "2025-01-20"
    version: "0.1.1"
    # ...

Best Practices

  1. Start small: Begin with 20-30 well-labeled repos
  2. Balance classes: Roughly equal AI and non-AI repos
  3. Cover sectors: Include high-risk domains
  4. Document decisions: Record why repos were labeled
  5. Version the dataset: Track changes to labels
  6. Regular updates: Re-evaluate as detection improves
  7. Human review: Spot-check results periodically

Example Benchmark Run

# 1. Ensure dataset is up to date
cat benchmark/repos.yaml | head -20
 
# 2. Clone repositories (first time only)
./scripts/clone_benchmark_repos.sh
 
# 3. Run benchmark
aigovhub benchmark --output benchmark/results/run-001/
 
# 4. Review results
cat benchmark/results/run-001/summary.json
 
# 5. Analyze failures
python scripts/analyze_failures.py benchmark/results/run-001/
 
# 6. Update detection if needed
# Edit src/aigovhub/core/constants.py
# Add new signal detectors
 
# 7. Re-run and compare
aigovhub benchmark --output benchmark/results/run-002/