Standalone Python CLI tool plan for Barclays IVU Model compliance testing: batch WebSocket analysis, AI-based scoring via Claude, consistency metrics, per-run PDF reports, and comparative drift detection reports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8.4 KiB
IVU Testing & Performance Monitoring Script
Context
Barclays IVU (Independent Validation Unit) Model compliance requires a structured methodology for testing and reporting on AI model performance. Per Section 7.1 of the requirements, the process involves ingesting 20 predefined assets, running them 5 times each, and comparing reports across Quality, Consistency, Accuracy, Completeness, and Response Time. This runs every 6 months to detect model drift and benchmark past responses.
This is a standalone Python CLI tool that connects to the running ModComms backend via WebSocket, runs batch analyses, scores results, and generates comparative reports. It does NOT live inside the site itself.
Directory Structure
testing/
├── ivu_runner.py # Main CLI entry point
├── config.py # CLI args + env var configuration
├── models.py # Pydantic data models for IVU results
├── utils.py # CSV parsing, file I/O, MIME detection
├── ws_client.py # Async WebSocket client
├── scoring/
│ ├── __init__.py
│ ├── ai_scorer.py # Claude-based Quality/Accuracy/Completeness scorer (Anthropic API)
│ └── consistency_scorer.py # Programmatic consistency comparison across runs
├── reporting/
│ ├── __init__.py
│ ├── pdf_report.py # Per-run PDF report (reportlab)
│ ├── comparative_report.py # Final HTML + PDF comparative summary
│ └── styles.py # Shared brand colors and styling constants
├── assets/ # Where the 20 test proof files go
│ └── .gitkeep
├── output/ # Default output directory (gitignored)
│ └── .gitkeep
├── requirements.txt # Standalone deps
└── sample_assets.csv # Example CSV template
Implementation Steps
Step 1: Data Models (testing/models.py)
Define Pydantic models mirroring backend schemas from backend/app/models/schemas.py:
SubReviewResult- mirrorsSubReview(ragStatus, feedback, issues)AgentReviewResult- mirrorsAgentReview(4 agent reviews + lead summary + overallStatus)SingleRunResult- one iteration result (agent_review, elapsed_seconds, iteration, error)AssetTestResult- all runs for one asset + computed scoresIVUTestSuite- top-level model for the entire test run
Step 2: Configuration (testing/config.py)
argparse-based CLI with settings:
--csv(required) - path to CSV file--assets-dir(default:./assets) - directory containing proof files--backend-url(default:ws://localhost:8000/ws/analyze) - WebSocket endpoint--iterations(default:5) - runs per asset--output-dir(default:./output/{timestamp})--anthropic-api-key(from envANTHROPIC_API_KEY) - for Claude-based scoring--access-token(default:ivu-test-tokenfor DISABLE_AUTH=true)--delay-between-runs(default:5seconds)--baseline- path to previousresults.jsonfor drift comparison--skip-scoring/--skip-pdfflags
Step 3: CSV Parsing & Utilities (testing/utils.py)
CSV columns: campaign_name, brand_guidelines, client_lead, agency, agency_lead, proof_name, channel, sub_channel, proof_type, proof_file_name
Validates all required columns exist and all referenced files exist in assets directory before starting.
Step 4: WebSocket Client (testing/ws_client.py)
Async client using websockets library that:
- Reads file, base64-encodes it, detects MIME type
- Sends
{"type": "analyze", ...}message (intentionally omitscampaign_idandproof_nameto avoid persisting test runs to the production database - seehandlers.py:211) - Listens for
agent_started,agent_completed,model_fallback,complete,errormessages - Returns structured result with timing
- 30s connection timeout, 300s analysis timeout, 3 retries with exponential backoff on connection failures
Step 5: AI-Based Scoring (testing/scoring/ai_scorer.py)
Uses Claude (Anthropic API) as an independent judge to evaluate each run's output. This is a stronger IVU approach than using Gemini to score itself - a completely separate AI provider evaluates the analysis quality.
- Sends the proof image + agent review text to Claude with a detailed rubric
- Uses
anthropicPython SDK with Claude Sonnet for cost-effective scoring - Returns structured JSON scores (1-10) for:
- Quality: Clarity, coherence, professional tone, no hallucinations, grammatical correctness
- Accuracy: Information matches visible proof content, correct facts/numbers, appropriate RAG status
- Completeness: All agent perspectives covered, legal/brand/channel aspects addressed, no omissions
- Requires
ANTHROPIC_API_KEYenv var (separate from the Gemini key used by the backend)
Step 6: Consistency Scoring (testing/scoring/consistency_scorer.py)
Programmatic comparison across iterations (no AI needed):
- RAG Status Consistency (25%): Unanimous RAG status across runs per agent
- Overall Status Consistency (25%): Same overallStatus across all runs
- Issue Overlap (25%): Fuzzy Jaccard similarity of issue lists using
difflib.SequenceMatcher - Feedback Similarity (25%): Normalized text comparison of feedback across runs
Step 7: Per-Run PDF Reports (testing/reporting/pdf_report.py)
Uses reportlab to generate PDFs matching the frontend's PDFReport.tsx layout:
- Cover page with Oliver brand, campaign info, iteration number
- Overall summary with RAG status and lead agent summary
- 4 agent review sections with RAG badges, feedback text, and key actions lists
- Brand colors: Navy
#001f5a, Blue#00a3e0, RAG colors from PDFReport.tsx
Step 8: Comparative Report (testing/reporting/comparative_report.py)
Generates both HTML and PDF:
- Executive Summary: Run metadata, aggregate pass/fail rates, average scores
- Score Dashboard Table: One row per asset with avg Quality, Accuracy, Completeness, Consistency, Response Time
- Per-Asset Detail: Score breakdown by iteration, RAG distribution, response time stats
- Model Drift Analysis: If baseline provided, delta per asset per metric, flagged regressions (>1 point drop)
Step 9: Main Runner (testing/ivu_runner.py)
Orchestrates the full flow:
- Parse CSV → validate files exist
- For each asset (20):
- For each iteration (5):
- Run WebSocket analysis → collect result + timing
- Generate per-run PDF
- Score with AI scorer (Quality, Accuracy, Completeness)
- Delay between runs
- Compute consistency score across all iterations
- For each iteration (5):
- Save raw
results.json(for future baseline comparisons) - Generate comparative report (HTML + PDF)
- Handle Ctrl+C gracefully (save partial results)
Output Structure
output/run_2026-03-03_14-30-00/
├── results.json # Machine-readable results (future baseline)
├── comparative_report.html # Browser-viewable report
├── comparative_report.pdf # PDF comparative report
├── per_run_pdfs/
│ ├── Asset01_ProofName_run1.pdf
│ ├── Asset01_ProofName_run2.pdf
│ └── ... # 20 assets × 5 runs = 100 PDFs
└── logs/
└── ivu_runner.log # Execution log
Dependencies (testing/requirements.txt)
websockets>=12.0
anthropic>=0.40.0
pydantic>=2.5.0
reportlab>=4.0
Key Files Referenced
backend/app/websocket/handlers.py- WebSocket protocol (lines 42-58 for file handling, 211 for persistence gate)backend/app/models/schemas.py- Canonical SubReview/AgentReview schemas to mirrorfrontend/components/PDFReport.tsx- PDF layout, brand colors, and structure to replicatefrontend/services/geminiService.ts- WebSocket client reference implementationbackend/app/services/gemini_service.py- Gemini API client pattern (backend reference only)
Verification
- Place test assets in
testing/assets/, create a CSV with 2-3 assets - Start backend with
DISABLE_AUTH=true - Run:
cd testing && python ivu_runner.py --csv sample_assets.csv --iterations 2 - Verify per-run PDFs are generated in output directory
- Verify
results.jsoncontains all scores and timing data - Verify comparative report HTML opens in browser with correct data
- Run again with
--baseline output/{previous}/results.jsonto verify drift detection