Vadym Samoilenko d7fd435210 Add IVU Testing & Performance Monitoring implementation plan

Standalone Python CLI tool plan for Barclays IVU Model compliance testing:
batch WebSocket analysis, AI-based scoring via Claude, consistency metrics,
per-run PDF reports, and comparative drift detection reports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 13:28:56 +00:00

8.4 KiB

Raw Permalink Blame History

IVU Testing & Performance Monitoring Script

Context

Barclays IVU (Independent Validation Unit) Model compliance requires a structured methodology for testing and reporting on AI model performance. Per Section 7.1 of the requirements, the process involves ingesting 20 predefined assets, running them 5 times each, and comparing reports across Quality, Consistency, Accuracy, Completeness, and Response Time. This runs every 6 months to detect model drift and benchmark past responses.

This is a standalone Python CLI tool that connects to the running ModComms backend via WebSocket, runs batch analyses, scores results, and generates comparative reports. It does NOT live inside the site itself.

Directory Structure

testing/
├── ivu_runner.py              # Main CLI entry point
├── config.py                  # CLI args + env var configuration
├── models.py                  # Pydantic data models for IVU results
├── utils.py                   # CSV parsing, file I/O, MIME detection
├── ws_client.py               # Async WebSocket client
├── scoring/
│   ├── __init__.py
│   ├── ai_scorer.py           # Claude-based Quality/Accuracy/Completeness scorer (Anthropic API)
│   └── consistency_scorer.py  # Programmatic consistency comparison across runs
├── reporting/
│   ├── __init__.py
│   ├── pdf_report.py          # Per-run PDF report (reportlab)
│   ├── comparative_report.py  # Final HTML + PDF comparative summary
│   └── styles.py              # Shared brand colors and styling constants
├── assets/                    # Where the 20 test proof files go
│   └── .gitkeep
├── output/                    # Default output directory (gitignored)
│   └── .gitkeep
├── requirements.txt           # Standalone deps
└── sample_assets.csv          # Example CSV template

Implementation Steps

Step 1: Data Models (`testing/models.py`)

Define Pydantic models mirroring backend schemas from backend/app/models/schemas.py:

SubReviewResult - mirrors SubReview (ragStatus, feedback, issues)
AgentReviewResult - mirrors AgentReview (4 agent reviews + lead summary + overallStatus)
SingleRunResult - one iteration result (agent_review, elapsed_seconds, iteration, error)
AssetTestResult - all runs for one asset + computed scores
IVUTestSuite - top-level model for the entire test run

Step 2: Configuration (`testing/config.py`)

argparse-based CLI with settings:

--csv (required) - path to CSV file
--assets-dir (default: ./assets) - directory containing proof files
--backend-url (default: ws://localhost:8000/ws/analyze) - WebSocket endpoint
--iterations (default: 5) - runs per asset
--output-dir (default: ./output/{timestamp})
--anthropic-api-key (from env ANTHROPIC_API_KEY) - for Claude-based scoring
--access-token (default: ivu-test-token for DISABLE_AUTH=true)
--delay-between-runs (default: 5 seconds)
--baseline - path to previous results.json for drift comparison
--skip-scoring / --skip-pdf flags

Step 3: CSV Parsing & Utilities (`testing/utils.py`)

CSV columns: campaign_name, brand_guidelines, client_lead, agency, agency_lead, proof_name, channel, sub_channel, proof_type, proof_file_name

Validates all required columns exist and all referenced files exist in assets directory before starting.

Step 4: WebSocket Client (`testing/ws_client.py`)

Async client using websockets library that:

Reads file, base64-encodes it, detects MIME type
Sends {"type": "analyze", ...} message (intentionally omits campaign_id and proof_name to avoid persisting test runs to the production database - see handlers.py:211)
Listens for agent_started, agent_completed, model_fallback, complete, error messages
Returns structured result with timing
30s connection timeout, 300s analysis timeout, 3 retries with exponential backoff on connection failures

Step 5: AI-Based Scoring (`testing/scoring/ai_scorer.py`)

Uses Claude (Anthropic API) as an independent judge to evaluate each run's output. This is a stronger IVU approach than using Gemini to score itself - a completely separate AI provider evaluates the analysis quality.

Sends the proof image + agent review text to Claude with a detailed rubric
Uses anthropic Python SDK with Claude Sonnet for cost-effective scoring
Returns structured JSON scores (1-10) for:
- Quality: Clarity, coherence, professional tone, no hallucinations, grammatical correctness
- Accuracy: Information matches visible proof content, correct facts/numbers, appropriate RAG status
- Completeness: All agent perspectives covered, legal/brand/channel aspects addressed, no omissions
Requires ANTHROPIC_API_KEY env var (separate from the Gemini key used by the backend)

Step 6: Consistency Scoring (`testing/scoring/consistency_scorer.py`)

Programmatic comparison across iterations (no AI needed):

RAG Status Consistency (25%): Unanimous RAG status across runs per agent
Overall Status Consistency (25%): Same overallStatus across all runs
Issue Overlap (25%): Fuzzy Jaccard similarity of issue lists using difflib.SequenceMatcher
Feedback Similarity (25%): Normalized text comparison of feedback across runs

Step 7: Per-Run PDF Reports (`testing/reporting/pdf_report.py`)

Uses reportlab to generate PDFs matching the frontend's PDFReport.tsx layout:

Cover page with Oliver brand, campaign info, iteration number
Overall summary with RAG status and lead agent summary
4 agent review sections with RAG badges, feedback text, and key actions lists
Brand colors: Navy #001f5a, Blue #00a3e0, RAG colors from PDFReport.tsx

Step 8: Comparative Report (`testing/reporting/comparative_report.py`)

Generates both HTML and PDF:

Executive Summary: Run metadata, aggregate pass/fail rates, average scores
Score Dashboard Table: One row per asset with avg Quality, Accuracy, Completeness, Consistency, Response Time
Per-Asset Detail: Score breakdown by iteration, RAG distribution, response time stats
Model Drift Analysis: If baseline provided, delta per asset per metric, flagged regressions (>1 point drop)

Step 9: Main Runner (`testing/ivu_runner.py`)

Orchestrates the full flow:

Parse CSV → validate files exist
For each asset (20):
- For each iteration (5):
  - Run WebSocket analysis → collect result + timing
  - Generate per-run PDF
  - Score with AI scorer (Quality, Accuracy, Completeness)
  - Delay between runs
- Compute consistency score across all iterations
Save raw results.json (for future baseline comparisons)
Generate comparative report (HTML + PDF)
Handle Ctrl+C gracefully (save partial results)

Output Structure

output/run_2026-03-03_14-30-00/
├── results.json                    # Machine-readable results (future baseline)
├── comparative_report.html         # Browser-viewable report
├── comparative_report.pdf          # PDF comparative report
├── per_run_pdfs/
│   ├── Asset01_ProofName_run1.pdf
│   ├── Asset01_ProofName_run2.pdf
│   └── ...                         # 20 assets × 5 runs = 100 PDFs
└── logs/
    └── ivu_runner.log              # Execution log

Dependencies (`testing/requirements.txt`)

websockets>=12.0
anthropic>=0.40.0
pydantic>=2.5.0
reportlab>=4.0

Key Files Referenced

backend/app/websocket/handlers.py - WebSocket protocol (lines 42-58 for file handling, 211 for persistence gate)
backend/app/models/schemas.py - Canonical SubReview/AgentReview schemas to mirror
frontend/components/PDFReport.tsx - PDF layout, brand colors, and structure to replicate
frontend/services/geminiService.ts - WebSocket client reference implementation
backend/app/services/gemini_service.py - Gemini API client pattern (backend reference only)

Verification

Place test assets in testing/assets/, create a CSV with 2-3 assets
Start backend with DISABLE_AUTH=true
Run: cd testing && python ivu_runner.py --csv sample_assets.csv --iterations 2
Verify per-run PDFs are generated in output directory
Verify results.json contains all scores and timing data
Verify comparative report HTML opens in browser with correct data
Run again with --baseline output/{previous}/results.json to verify drift detection

8.4 KiB Raw Permalink Blame History Unescape Escape

IVU Testing & Performance Monitoring Script

Context

Directory Structure

Implementation Steps

Step 1: Data Models (testing/models.py)

Step 2: Configuration (testing/config.py)

Step 3: CSV Parsing & Utilities (testing/utils.py)

Step 4: WebSocket Client (testing/ws_client.py)

Step 5: AI-Based Scoring (testing/scoring/ai_scorer.py)

Step 6: Consistency Scoring (testing/scoring/consistency_scorer.py)

Step 7: Per-Run PDF Reports (testing/reporting/pdf_report.py)

Step 8: Comparative Report (testing/reporting/comparative_report.py)

Step 9: Main Runner (testing/ivu_runner.py)