modcomms/IVU_TESTING_PLAN.md
Vadym Samoilenko d7fd435210 Add IVU Testing & Performance Monitoring implementation plan
Standalone Python CLI tool plan for Barclays IVU Model compliance testing:
batch WebSocket analysis, AI-based scoring via Claude, consistency metrics,
per-run PDF reports, and comparative drift detection reports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:28:56 +00:00

8.4 KiB
Raw Permalink Blame History

IVU Testing & Performance Monitoring Script

Context

Barclays IVU (Independent Validation Unit) Model compliance requires a structured methodology for testing and reporting on AI model performance. Per Section 7.1 of the requirements, the process involves ingesting 20 predefined assets, running them 5 times each, and comparing reports across Quality, Consistency, Accuracy, Completeness, and Response Time. This runs every 6 months to detect model drift and benchmark past responses.

This is a standalone Python CLI tool that connects to the running ModComms backend via WebSocket, runs batch analyses, scores results, and generates comparative reports. It does NOT live inside the site itself.

Directory Structure

testing/
├── ivu_runner.py              # Main CLI entry point
├── config.py                  # CLI args + env var configuration
├── models.py                  # Pydantic data models for IVU results
├── utils.py                   # CSV parsing, file I/O, MIME detection
├── ws_client.py               # Async WebSocket client
├── scoring/
│   ├── __init__.py
│   ├── ai_scorer.py           # Claude-based Quality/Accuracy/Completeness scorer (Anthropic API)
│   └── consistency_scorer.py  # Programmatic consistency comparison across runs
├── reporting/
│   ├── __init__.py
│   ├── pdf_report.py          # Per-run PDF report (reportlab)
│   ├── comparative_report.py  # Final HTML + PDF comparative summary
│   └── styles.py              # Shared brand colors and styling constants
├── assets/                    # Where the 20 test proof files go
│   └── .gitkeep
├── output/                    # Default output directory (gitignored)
│   └── .gitkeep
├── requirements.txt           # Standalone deps
└── sample_assets.csv          # Example CSV template

Implementation Steps

Step 1: Data Models (testing/models.py)

Define Pydantic models mirroring backend schemas from backend/app/models/schemas.py:

  • SubReviewResult - mirrors SubReview (ragStatus, feedback, issues)
  • AgentReviewResult - mirrors AgentReview (4 agent reviews + lead summary + overallStatus)
  • SingleRunResult - one iteration result (agent_review, elapsed_seconds, iteration, error)
  • AssetTestResult - all runs for one asset + computed scores
  • IVUTestSuite - top-level model for the entire test run

Step 2: Configuration (testing/config.py)

argparse-based CLI with settings:

  • --csv (required) - path to CSV file
  • --assets-dir (default: ./assets) - directory containing proof files
  • --backend-url (default: ws://localhost:8000/ws/analyze) - WebSocket endpoint
  • --iterations (default: 5) - runs per asset
  • --output-dir (default: ./output/{timestamp})
  • --anthropic-api-key (from env ANTHROPIC_API_KEY) - for Claude-based scoring
  • --access-token (default: ivu-test-token for DISABLE_AUTH=true)
  • --delay-between-runs (default: 5 seconds)
  • --baseline - path to previous results.json for drift comparison
  • --skip-scoring / --skip-pdf flags

Step 3: CSV Parsing & Utilities (testing/utils.py)

CSV columns: campaign_name, brand_guidelines, client_lead, agency, agency_lead, proof_name, channel, sub_channel, proof_type, proof_file_name

Validates all required columns exist and all referenced files exist in assets directory before starting.

Step 4: WebSocket Client (testing/ws_client.py)

Async client using websockets library that:

  • Reads file, base64-encodes it, detects MIME type
  • Sends {"type": "analyze", ...} message (intentionally omits campaign_id and proof_name to avoid persisting test runs to the production database - see handlers.py:211)
  • Listens for agent_started, agent_completed, model_fallback, complete, error messages
  • Returns structured result with timing
  • 30s connection timeout, 300s analysis timeout, 3 retries with exponential backoff on connection failures

Step 5: AI-Based Scoring (testing/scoring/ai_scorer.py)

Uses Claude (Anthropic API) as an independent judge to evaluate each run's output. This is a stronger IVU approach than using Gemini to score itself - a completely separate AI provider evaluates the analysis quality.

  • Sends the proof image + agent review text to Claude with a detailed rubric
  • Uses anthropic Python SDK with Claude Sonnet for cost-effective scoring
  • Returns structured JSON scores (1-10) for:
    • Quality: Clarity, coherence, professional tone, no hallucinations, grammatical correctness
    • Accuracy: Information matches visible proof content, correct facts/numbers, appropriate RAG status
    • Completeness: All agent perspectives covered, legal/brand/channel aspects addressed, no omissions
  • Requires ANTHROPIC_API_KEY env var (separate from the Gemini key used by the backend)

Step 6: Consistency Scoring (testing/scoring/consistency_scorer.py)

Programmatic comparison across iterations (no AI needed):

  • RAG Status Consistency (25%): Unanimous RAG status across runs per agent
  • Overall Status Consistency (25%): Same overallStatus across all runs
  • Issue Overlap (25%): Fuzzy Jaccard similarity of issue lists using difflib.SequenceMatcher
  • Feedback Similarity (25%): Normalized text comparison of feedback across runs

Step 7: Per-Run PDF Reports (testing/reporting/pdf_report.py)

Uses reportlab to generate PDFs matching the frontend's PDFReport.tsx layout:

  • Cover page with Oliver brand, campaign info, iteration number
  • Overall summary with RAG status and lead agent summary
  • 4 agent review sections with RAG badges, feedback text, and key actions lists
  • Brand colors: Navy #001f5a, Blue #00a3e0, RAG colors from PDFReport.tsx

Step 8: Comparative Report (testing/reporting/comparative_report.py)

Generates both HTML and PDF:

  • Executive Summary: Run metadata, aggregate pass/fail rates, average scores
  • Score Dashboard Table: One row per asset with avg Quality, Accuracy, Completeness, Consistency, Response Time
  • Per-Asset Detail: Score breakdown by iteration, RAG distribution, response time stats
  • Model Drift Analysis: If baseline provided, delta per asset per metric, flagged regressions (>1 point drop)

Step 9: Main Runner (testing/ivu_runner.py)

Orchestrates the full flow:

  1. Parse CSV → validate files exist
  2. For each asset (20):
    • For each iteration (5):
      • Run WebSocket analysis → collect result + timing
      • Generate per-run PDF
      • Score with AI scorer (Quality, Accuracy, Completeness)
      • Delay between runs
    • Compute consistency score across all iterations
  3. Save raw results.json (for future baseline comparisons)
  4. Generate comparative report (HTML + PDF)
  5. Handle Ctrl+C gracefully (save partial results)

Output Structure

output/run_2026-03-03_14-30-00/
├── results.json                    # Machine-readable results (future baseline)
├── comparative_report.html         # Browser-viewable report
├── comparative_report.pdf          # PDF comparative report
├── per_run_pdfs/
│   ├── Asset01_ProofName_run1.pdf
│   ├── Asset01_ProofName_run2.pdf
│   └── ...                         # 20 assets × 5 runs = 100 PDFs
└── logs/
    └── ivu_runner.log              # Execution log

Dependencies (testing/requirements.txt)

websockets>=12.0
anthropic>=0.40.0
pydantic>=2.5.0
reportlab>=4.0

Key Files Referenced

  • backend/app/websocket/handlers.py - WebSocket protocol (lines 42-58 for file handling, 211 for persistence gate)
  • backend/app/models/schemas.py - Canonical SubReview/AgentReview schemas to mirror
  • frontend/components/PDFReport.tsx - PDF layout, brand colors, and structure to replicate
  • frontend/services/geminiService.ts - WebSocket client reference implementation
  • backend/app/services/gemini_service.py - Gemini API client pattern (backend reference only)

Verification

  1. Place test assets in testing/assets/, create a CSV with 2-3 assets
  2. Start backend with DISABLE_AUTH=true
  3. Run: cd testing && python ivu_runner.py --csv sample_assets.csv --iterations 2
  4. Verify per-run PDFs are generated in output directory
  5. Verify results.json contains all scores and timing data
  6. Verify comparative report HTML opens in browser with correct data
  7. Run again with --baseline output/{previous}/results.json to verify drift detection