Benchmark Guide
aigovhub-cli
Benchmark Guide
This guide explains how to use the AIGovHub benchmark dataset to evaluate detection accuracy.
Overview
The benchmark dataset allows you to:
- Evaluate detection accuracy against labeled repositories
- Track improvements as detection signals evolve
- Identify edge cases and false positives/negatives
- Compare configurations (thresholds, LLM providers)
Benchmark Dataset
Location
benchmark/
├── repos.yaml # Repository manifest with labels
├── cached/ # Cloned repos (gitignored)
├── results/ # Evaluation results
└── analysis/ # Analysis notebooks
Dataset Format
# benchmark/repos.yaml
schema_version: "1.0.0"
last_updated: "2025-01-15"
description: "Curated dataset for evaluating AI detection accuracy"
targets:
accuracy: 0.90
precision: 0.85
recall: 0.95
f1_score: 0.90
repositories:
# Known AI repository
- url: "https://github.com/huggingface/transformers"
name: "transformers"
labels:
contains_ai: true
ai_type: ["deep_learning", "nlp"]
sector: "general_ml"
risk_category: "minimal_risk"
frameworks: ["pytorch", "tensorflow"]
notes: "HuggingFace Transformers library"
# Known non-AI repository
- url: "https://github.com/pallets/flask"
name: "flask"
labels:
contains_ai: false
sector: "web_development"
risk_category: null
notes: "Python web framework - no AI components"Label Fields
| Field | Type | Description |
|---|---|---|
contains_ai |
boolean | Ground truth: does repo contain AI? |
ai_type |
string[] | Types of AI present |
sector |
string | Application domain |
risk_category |
string | EU AI Act risk category |
frameworks |
string[] | AI frameworks used |
notes |
string | Human notes |
Running Benchmarks
Basic Benchmark
# Run benchmark with default dataset
aigovhub benchmark
# Use custom dataset
aigovhub benchmark --dataset my-repos.yaml
# Output results to specific directory
aigovhub benchmark --output results/run-001/Manual Benchmark Process
Until the benchmark command is fully implemented, you can run benchmarks manually:
#!/bin/bash
# benchmark_run.sh
REPOS_FILE="benchmark/repos.yaml"
RESULTS_DIR="benchmark/results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$RESULTS_DIR"
# Parse repos from YAML and scan each
# (Requires yq or similar YAML parser)
for repo in $(cat repos.txt); do
echo "Scanning $repo..."
# Clone if not cached
if [ ! -d "benchmark/cached/$repo" ]; then
git clone "https://github.com/$repo" "benchmark/cached/$repo"
fi
# Scan
aigovhub scan "benchmark/cached/$repo" \
--output "$RESULTS_DIR/$repo.yaml" \
--no-llm
done
# Evaluate results
python scripts/evaluate_benchmark.py "$RESULTS_DIR"Python Benchmark Script
"""benchmark_evaluate.py - Evaluate benchmark results."""
import yaml
from pathlib import Path
from dataclasses import dataclass
@dataclass
class BenchmarkResult:
true_positives: int = 0
true_negatives: int = 0
false_positives: int = 0
false_negatives: int = 0
@property
def accuracy(self) -> float:
total = self.tp + self.tn + self.fp + self.fn
return (self.tp + self.tn) / total if total else 0
@property
def precision(self) -> float:
return self.tp / (self.tp + self.fp) if (self.tp + self.fp) else 0
@property
def recall(self) -> float:
return self.tp / (self.tp + self.fn) if (self.tp + self.fn) else 0
@property
def f1_score(self) -> float:
p, r = self.precision, self.recall
return 2 * p * r / (p + r) if (p + r) else 0
def evaluate(repos_file: Path, results_dir: Path) -> BenchmarkResult:
"""Evaluate benchmark results against ground truth."""
result = BenchmarkResult()
# Load ground truth
with open(repos_file) as f:
dataset = yaml.safe_load(f)
for repo in dataset["repositories"]:
name = repo["name"]
expected_ai = repo["labels"]["contains_ai"]
# Load scan result
result_file = results_dir / f"{name}.yaml"
if not result_file.exists():
print(f"Warning: No result for {name}")
continue
with open(result_file) as f:
scan_result = yaml.safe_load(f)
detected_ai = len(scan_result.get("ai_systems", [])) > 0
# Update metrics
if expected_ai and detected_ai:
result.true_positives += 1
elif not expected_ai and not detected_ai:
result.true_negatives += 1
elif not expected_ai and detected_ai:
result.false_positives += 1
print(f"False positive: {name}")
else: # expected_ai and not detected_ai
result.false_negatives += 1
print(f"False negative: {name}")
return result
if __name__ == "__main__":
import sys
results_dir = Path(sys.argv[1])
result = evaluate(Path("benchmark/repos.yaml"), results_dir)
print(f"\nBenchmark Results:")
print(f" Accuracy: {result.accuracy:.1%}")
print(f" Precision: {result.precision:.1%}")
print(f" Recall: {result.recall:.1%}")
print(f" F1 Score: {result.f1_score:.1%}")Metrics
Definitions
| Metric | Formula | Description |
|---|---|---|
| Accuracy | (TP + TN) / Total | Overall correctness |
| Precision | TP / (TP + FP) | How many detections are correct |
| Recall | TP / (TP + FN) | How many AI systems are found |
| F1 Score | 2 × (P × R) / (P + R) | Harmonic mean of precision and recall |
Confusion Matrix
Predicted
AI Non-AI
Actual AI TP FN
Non-AI FP TN
Where:
- TP (True Positive): AI repo correctly detected
- TN (True Negative): Non-AI repo correctly ignored
- FP (False Positive): Non-AI repo incorrectly flagged
- FN (False Negative): AI repo missed
Target Metrics
| Metric | Target | Rationale |
|---|---|---|
| Accuracy | >90% | General correctness |
| Precision | >85% | Minimize false alarms |
| Recall | >95% | Critical: Don't miss AI systems |
| F1 Score | >90% | Balanced performance |
Note: High recall is prioritized because missing an AI system (false negative) has greater compliance implications than a false positive.
Curating the Dataset
Repository Selection Criteria
- Diversity: Cover multiple AI types and sectors
- Clarity: Clear AI vs non-AI distinction
- Size: Meaningful codebases (not trivial demos)
- Activity: Recently maintained
- License: Allows analysis
Finding AI Repositories
GitHub search queries:
# Deep learning
topic:deep-learning stars:>100
# NLP
topic:nlp topic:transformers stars:>50
# Computer vision
topic:computer-vision topic:pytorch stars:>50
# LLM integrations
topic:langchain OR topic:llm stars:>50
# Sector-specific
topic:healthcare topic:ml
topic:fintech topic:ai
Finding Non-AI Repositories
# Web frameworks
topic:web-framework -topic:ai -topic:ml stars:>1000
# Developer tools
topic:cli -topic:ai stars:>500
# Libraries
topic:library -topic:machine-learning stars:>500
Labeling Process
For each repository:
- Clone and inspect the codebase
- Check dependencies for AI libraries
- Look for model files (.pt, .onnx, etc.)
- Review documentation for AI mentions
- Assign labels based on evidence
# Template for new repository
- url: "https://github.com/org/repo"
name: "repo"
labels:
contains_ai: true|false
ai_type: [] # If contains_ai: true
sector: ""
risk_category: null # or specific category
frameworks: []
notes: "Brief description"Analyzing Results
Identifying Patterns
After running benchmarks, analyze failures:
# Find common patterns in false negatives
for repo in false_negatives:
print(f"Missed: {repo['name']}")
print(f" AI Types: {repo['labels']['ai_type']}")
print(f" Frameworks: {repo['labels']['frameworks']}")Questions to ask:
- Are certain frameworks consistently missed?
- Are certain AI types harder to detect?
- Do false positives share common patterns?
Improving Detection
Based on analysis:
- Add libraries to
constants.pyif frameworks are missed - Add patterns to code detection if patterns are missed
- Adjust thresholds if confidence is miscalibrated
- Add test cases for identified edge cases
Tracking Progress
Store benchmark results over time:
# benchmark/results/summary.yaml
runs:
- date: "2025-01-15"
version: "0.1.0"
config:
threshold: 0.7
llm: false
results:
accuracy: 0.92
precision: 0.88
recall: 0.96
f1: 0.92
notes: "Added transformers detection"
- date: "2025-01-20"
version: "0.1.1"
# ...Best Practices
- Start small: Begin with 20-30 well-labeled repos
- Balance classes: Roughly equal AI and non-AI repos
- Cover sectors: Include high-risk domains
- Document decisions: Record why repos were labeled
- Version the dataset: Track changes to labels
- Regular updates: Re-evaluate as detection improves
- Human review: Spot-check results periodically
Example Benchmark Run
# 1. Ensure dataset is up to date
cat benchmark/repos.yaml | head -20
# 2. Clone repositories (first time only)
./scripts/clone_benchmark_repos.sh
# 3. Run benchmark
aigovhub benchmark --output benchmark/results/run-001/
# 4. Review results
cat benchmark/results/run-001/summary.json
# 5. Analyze failures
python scripts/analyze_failures.py benchmark/results/run-001/
# 6. Update detection if needed
# Edit src/aigovhub/core/constants.py
# Add new signal detectors
# 7. Re-run and compare
aigovhub benchmark --output benchmark/results/run-002/