Benchmark Guide

This guide explains how to use the AIGovHub benchmark dataset to evaluate detection accuracy.

Overview

The benchmark dataset allows you to:

Evaluate detection accuracy against labeled repositories
Track improvements as detection signals evolve
Identify edge cases and false positives/negatives
Compare configurations (thresholds, LLM providers)

Benchmark Dataset

Location

benchmark/
├── repos.yaml          # Repository manifest with labels
├── cached/             # Cloned repos (gitignored)
├── results/            # Evaluation results
└── analysis/           # Analysis notebooks

Dataset Format

# benchmark/repos.yaml
schema_version: "1.0.0"
last_updated: "2025-01-15"
description: "Curated dataset for evaluating AI detection accuracy"
 
targets:
  accuracy: 0.90
  precision: 0.85
  recall: 0.95
  f1_score: 0.90
 
repositories:
  # Known AI repository
  - url: "https://github.com/huggingface/transformers"
    name: "transformers"
    labels:
      contains_ai: true
      ai_type: ["deep_learning", "nlp"]
      sector: "general_ml"
      risk_category: "minimal_risk"
      frameworks: ["pytorch", "tensorflow"]
    notes: "HuggingFace Transformers library"
 
  # Known non-AI repository
  - url: "https://github.com/pallets/flask"
    name: "flask"
    labels:
      contains_ai: false
      sector: "web_development"
      risk_category: null
    notes: "Python web framework - no AI components"

Label Fields

Field	Type	Description
`contains_ai`	boolean	Ground truth: does repo contain AI?
`ai_type`	string[]	Types of AI present
`sector`	string	Application domain
`risk_category`	string	EU AI Act risk category
`frameworks`	string[]	AI frameworks used
`notes`	string	Human notes

Running Benchmarks

Basic Benchmark

# Run benchmark with default dataset
aigovhub benchmark
 
# Use custom dataset
aigovhub benchmark --dataset my-repos.yaml
 
# Output results to specific directory
aigovhub benchmark --output results/run-001/

Manual Benchmark Process

Until the benchmark command is fully implemented, you can run benchmarks manually:

#!/bin/bash
# benchmark_run.sh
 
REPOS_FILE="benchmark/repos.yaml"
RESULTS_DIR="benchmark/results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"
 
# Parse repos from YAML and scan each
# (Requires yq or similar YAML parser)
 
for repo in $(cat repos.txt); do
    echo "Scanning $repo..."
 
    # Clone if not cached
    if [ ! -d "benchmark/cached/$repo" ]; then
        git clone "https://github.com/$repo" "benchmark/cached/$repo"
    fi
 
    # Scan
    aigovhub scan "benchmark/cached/$repo" \
        --output "$RESULTS_DIR/$repo.yaml" \
        --no-llm
 
done
 
# Evaluate results
python scripts/evaluate_benchmark.py "$RESULTS_DIR"

Python Benchmark Script

"""benchmark_evaluate.py - Evaluate benchmark results."""
 
import yaml
from pathlib import Path
from dataclasses import dataclass
 
@dataclass
class BenchmarkResult:
    true_positives: int = 0
    true_negatives: int = 0
    false_positives: int = 0
    false_negatives: int = 0
 
    @property
    def accuracy(self) -> float:
        total = self.tp + self.tn + self.fp + self.fn
        return (self.tp + self.tn) / total if total else 0
 
    @property
    def precision(self) -> float:
        return self.tp / (self.tp + self.fp) if (self.tp + self.fp) else 0
 
    @property
    def recall(self) -> float:
        return self.tp / (self.tp + self.fn) if (self.tp + self.fn) else 0
 
    @property
    def f1_score(self) -> float:
        p, r = self.precision, self.recall
        return 2 * p * r / (p + r) if (p + r) else 0
 
def evaluate(repos_file: Path, results_dir: Path) -> BenchmarkResult:
    """Evaluate benchmark results against ground truth."""
    result = BenchmarkResult()
 
    # Load ground truth
    with open(repos_file) as f:
        dataset = yaml.safe_load(f)
 
    for repo in dataset["repositories"]:
        name = repo["name"]
        expected_ai = repo["labels"]["contains_ai"]
 
        # Load scan result
        result_file = results_dir / f"{name}.yaml"
        if not result_file.exists():
            print(f"Warning: No result for {name}")
            continue
 
        with open(result_file) as f:
            scan_result = yaml.safe_load(f)
 
        detected_ai = len(scan_result.get("ai_systems", [])) > 0
 
        # Update metrics
        if expected_ai and detected_ai:
            result.true_positives += 1
        elif not expected_ai and not detected_ai:
            result.true_negatives += 1
        elif not expected_ai and detected_ai:
            result.false_positives += 1
            print(f"False positive: {name}")
        else:  # expected_ai and not detected_ai
            result.false_negatives += 1
            print(f"False negative: {name}")
 
    return result
 
if __name__ == "__main__":
    import sys
    results_dir = Path(sys.argv[1])
    result = evaluate(Path("benchmark/repos.yaml"), results_dir)
 
    print(f"\nBenchmark Results:")
    print(f"  Accuracy:  {result.accuracy:.1%}")
    print(f"  Precision: {result.precision:.1%}")
    print(f"  Recall:    {result.recall:.1%}")
    print(f"  F1 Score:  {result.f1_score:.1%}")

Metrics

Definitions

Metric	Formula	Description
Accuracy	(TP + TN) / Total	Overall correctness
Precision	TP / (TP + FP)	How many detections are correct
Recall	TP / (TP + FN)	How many AI systems are found
F1 Score	2 × (P × R) / (P + R)	Harmonic mean of precision and recall

Confusion Matrix

                    Predicted
                    AI    Non-AI
Actual  AI          TP      FN
        Non-AI      FP      TN

Where:

TP (True Positive): AI repo correctly detected
TN (True Negative): Non-AI repo correctly ignored
FP (False Positive): Non-AI repo incorrectly flagged
FN (False Negative): AI repo missed

Target Metrics

Metric	Target	Rationale
Accuracy	>90%	General correctness
Precision	>85%	Minimize false alarms
Recall	>95%	Critical: Don't miss AI systems
F1 Score	>90%	Balanced performance

Note: High recall is prioritized because missing an AI system (false negative) has greater compliance implications than a false positive.

Curating the Dataset

Repository Selection Criteria

Diversity: Cover multiple AI types and sectors
Clarity: Clear AI vs non-AI distinction
Size: Meaningful codebases (not trivial demos)
Activity: Recently maintained
License: Allows analysis

Finding AI Repositories

GitHub search queries:

# Deep learning
topic:deep-learning stars:>100

# NLP
topic:nlp topic:transformers stars:>50

# Computer vision
topic:computer-vision topic:pytorch stars:>50

# LLM integrations
topic:langchain OR topic:llm stars:>50

# Sector-specific
topic:healthcare topic:ml
topic:fintech topic:ai

Finding Non-AI Repositories

# Web frameworks
topic:web-framework -topic:ai -topic:ml stars:>1000

# Developer tools
topic:cli -topic:ai stars:>500

# Libraries
topic:library -topic:machine-learning stars:>500

Labeling Process

For each repository:

Clone and inspect the codebase
Check dependencies for AI libraries
Look for model files (.pt, .onnx, etc.)
Review documentation for AI mentions
Assign labels based on evidence

# Template for new repository
- url: "https://github.com/org/repo"
  name: "repo"
  labels:
    contains_ai: true|false
    ai_type: []              # If contains_ai: true
    sector: ""
    risk_category: null      # or specific category
    frameworks: []
  notes: "Brief description"

Analyzing Results

Identifying Patterns

After running benchmarks, analyze failures:

# Find common patterns in false negatives
for repo in false_negatives:
    print(f"Missed: {repo['name']}")
    print(f"  AI Types: {repo['labels']['ai_type']}")
    print(f"  Frameworks: {repo['labels']['frameworks']}")

Questions to ask:

Are certain frameworks consistently missed?
Are certain AI types harder to detect?
Do false positives share common patterns?

Improving Detection

Based on analysis:

Add libraries to constants.py if frameworks are missed
Add patterns to code detection if patterns are missed
Adjust thresholds if confidence is miscalibrated
Add test cases for identified edge cases

Tracking Progress

Store benchmark results over time:

# benchmark/results/summary.yaml
runs:
  - date: "2025-01-15"
    version: "0.1.0"
    config:
      threshold: 0.7
      llm: false
    results:
      accuracy: 0.92
      precision: 0.88
      recall: 0.96
      f1: 0.92
    notes: "Added transformers detection"
 
  - date: "2025-01-20"
    version: "0.1.1"
    # ...

Best Practices

Start small: Begin with 20-30 well-labeled repos
Balance classes: Roughly equal AI and non-AI repos
Cover sectors: Include high-risk domains
Document decisions: Record why repos were labeled
Version the dataset: Track changes to labels
Regular updates: Re-evaluate as detection improves
Human review: Spot-check results periodically

Example Benchmark Run

# 1. Ensure dataset is up to date
cat benchmark/repos.yaml | head -20
 
# 2. Clone repositories (first time only)
./scripts/clone_benchmark_repos.sh
 
# 3. Run benchmark
aigovhub benchmark --output benchmark/results/run-001/
 
# 4. Review results
cat benchmark/results/run-001/summary.json
 
# 5. Analyze failures
python scripts/analyze_failures.py benchmark/results/run-001/
 
# 6. Update detection if needed
# Edit src/aigovhub/core/constants.py
# Add new signal detectors
 
# 7. Re-run and compare
aigovhub benchmark --output benchmark/results/run-002/