Add IVU Testing & Performance Monitoring implementation plan

Standalone Python CLI tool plan for Barclays IVU Model compliance testing: batch WebSocket analysis, AI-based scoring via Claude, consistency metrics, per-run PDF reports, and comparative drift detection reports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:28:56 +00:00 · 2026-03-03 13:28:56 +00:00 · d7fd435210
commit d7fd435210
parent 0432635153
1 changed files with 152 additions and 0 deletions
--- a/IVU_TESTING_PLAN.md
+++ b/IVU_TESTING_PLAN.md
@ -0,0 +1,152 @@
+# IVU Testing & Performance Monitoring Script
+
+## Context
+
+Barclays IVU (Independent Validation Unit) Model compliance requires a structured methodology for testing and reporting on AI model performance. Per Section 7.1 of the requirements, the process involves ingesting 20 predefined assets, running them 5 times each, and comparing reports across Quality, Consistency, Accuracy, Completeness, and Response Time. This runs every 6 months to detect model drift and benchmark past responses.
+
+This is a standalone Python CLI tool that connects to the running ModComms backend via WebSocket, runs batch analyses, scores results, and generates comparative reports. It does NOT live inside the site itself.
+
+## Directory Structure
+
+```
+testing/
+├── ivu_runner.py              # Main CLI entry point
+├── config.py                  # CLI args + env var configuration
+├── models.py                  # Pydantic data models for IVU results
+├── utils.py                   # CSV parsing, file I/O, MIME detection
+├── ws_client.py               # Async WebSocket client
+├── scoring/
+│   ├── __init__.py
+│   ├── ai_scorer.py           # Claude-based Quality/Accuracy/Completeness scorer (Anthropic API)
+│   └── consistency_scorer.py  # Programmatic consistency comparison across runs
+├── reporting/
+│   ├── __init__.py
+│   ├── pdf_report.py          # Per-run PDF report (reportlab)
+│   ├── comparative_report.py  # Final HTML + PDF comparative summary
+│   └── styles.py              # Shared brand colors and styling constants
+├── assets/                    # Where the 20 test proof files go
+│   └── .gitkeep
+├── output/                    # Default output directory (gitignored)
+│   └── .gitkeep
+├── requirements.txt           # Standalone deps
+└── sample_assets.csv          # Example CSV template
+```
+
+## Implementation Steps
+
+### Step 1: Data Models (`testing/models.py`)
+Define Pydantic models mirroring backend schemas from `backend/app/models/schemas.py`:
+- `SubReviewResult` - mirrors `SubReview` (ragStatus, feedback, issues)
+- `AgentReviewResult` - mirrors `AgentReview` (4 agent reviews + lead summary + overallStatus)
+- `SingleRunResult` - one iteration result (agent_review, elapsed_seconds, iteration, error)
+- `AssetTestResult` - all runs for one asset + computed scores
+- `IVUTestSuite` - top-level model for the entire test run
+
+### Step 2: Configuration (`testing/config.py`)
+argparse-based CLI with settings:
+- `--csv` (required) - path to CSV file
+- `--assets-dir` (default: `./assets`) - directory containing proof files
+- `--backend-url` (default: `ws://localhost:8000/ws/analyze`) - WebSocket endpoint
+- `--iterations` (default: `5`) - runs per asset
+- `--output-dir` (default: `./output/{timestamp}`)
+- `--anthropic-api-key` (from env `ANTHROPIC_API_KEY`) - for Claude-based scoring
+- `--access-token` (default: `ivu-test-token` for DISABLE_AUTH=true)
+- `--delay-between-runs` (default: `5` seconds)
+- `--baseline` - path to previous `results.json` for drift comparison
+- `--skip-scoring` / `--skip-pdf` flags
+
+### Step 3: CSV Parsing & Utilities (`testing/utils.py`)
+CSV columns: `campaign_name`, `brand_guidelines`, `client_lead`, `agency`, `agency_lead`, `proof_name`, `channel`, `sub_channel`, `proof_type`, `proof_file_name`
+
+Validates all required columns exist and all referenced files exist in assets directory before starting.
+
+### Step 4: WebSocket Client (`testing/ws_client.py`)
+Async client using `websockets` library that:
+- Reads file, base64-encodes it, detects MIME type
+- Sends `{"type": "analyze", ...}` message (intentionally omits `campaign_id` and `proof_name` to avoid persisting test runs to the production database - see `handlers.py:211`)
+- Listens for `agent_started`, `agent_completed`, `model_fallback`, `complete`, `error` messages
+- Returns structured result with timing
+- 30s connection timeout, 300s analysis timeout, 3 retries with exponential backoff on connection failures
+
+### Step 5: AI-Based Scoring (`testing/scoring/ai_scorer.py`)
+Uses **Claude (Anthropic API)** as an independent judge to evaluate each run's output. This is a stronger IVU approach than using Gemini to score itself - a completely separate AI provider evaluates the analysis quality.
+- Sends the proof image + agent review text to Claude with a detailed rubric
+- Uses `anthropic` Python SDK with Claude Sonnet for cost-effective scoring
+- Returns structured JSON scores (1-10) for:
+  - **Quality**: Clarity, coherence, professional tone, no hallucinations, grammatical correctness
+  - **Accuracy**: Information matches visible proof content, correct facts/numbers, appropriate RAG status
+  - **Completeness**: All agent perspectives covered, legal/brand/channel aspects addressed, no omissions
+- Requires `ANTHROPIC_API_KEY` env var (separate from the Gemini key used by the backend)
+
+### Step 6: Consistency Scoring (`testing/scoring/consistency_scorer.py`)
+Programmatic comparison across iterations (no AI needed):
+- **RAG Status Consistency (25%)**: Unanimous RAG status across runs per agent
+- **Overall Status Consistency (25%)**: Same overallStatus across all runs
+- **Issue Overlap (25%)**: Fuzzy Jaccard similarity of issue lists using `difflib.SequenceMatcher`
+- **Feedback Similarity (25%)**: Normalized text comparison of feedback across runs
+
+### Step 7: Per-Run PDF Reports (`testing/reporting/pdf_report.py`)
+Uses `reportlab` to generate PDFs matching the frontend's `PDFReport.tsx` layout:
+- Cover page with Oliver brand, campaign info, iteration number
+- Overall summary with RAG status and lead agent summary
+- 4 agent review sections with RAG badges, feedback text, and key actions lists
+- Brand colors: Navy `#001f5a`, Blue `#00a3e0`, RAG colors from PDFReport.tsx
+
+### Step 8: Comparative Report (`testing/reporting/comparative_report.py`)
+Generates both HTML and PDF:
+- **Executive Summary**: Run metadata, aggregate pass/fail rates, average scores
+- **Score Dashboard Table**: One row per asset with avg Quality, Accuracy, Completeness, Consistency, Response Time
+- **Per-Asset Detail**: Score breakdown by iteration, RAG distribution, response time stats
+- **Model Drift Analysis**: If baseline provided, delta per asset per metric, flagged regressions (>1 point drop)
+
+### Step 9: Main Runner (`testing/ivu_runner.py`)
+Orchestrates the full flow:
+1. Parse CSV → validate files exist
+2. For each asset (20):
+   - For each iteration (5):
+     - Run WebSocket analysis → collect result + timing
+     - Generate per-run PDF
+     - Score with AI scorer (Quality, Accuracy, Completeness)
+     - Delay between runs
+   - Compute consistency score across all iterations
+3. Save raw `results.json` (for future baseline comparisons)
+4. Generate comparative report (HTML + PDF)
+5. Handle Ctrl+C gracefully (save partial results)
+
+## Output Structure
+```
+output/run_2026-03-03_14-30-00/
+├── results.json                    # Machine-readable results (future baseline)
+├── comparative_report.html         # Browser-viewable report
+├── comparative_report.pdf          # PDF comparative report
+├── per_run_pdfs/
+│   ├── Asset01_ProofName_run1.pdf
+│   ├── Asset01_ProofName_run2.pdf
+│   └── ...                         # 20 assets × 5 runs = 100 PDFs
+└── logs/
+    └── ivu_runner.log              # Execution log
+```
+
+## Dependencies (`testing/requirements.txt`)
+```
+websockets>=12.0
+anthropic>=0.40.0
+pydantic>=2.5.0
+reportlab>=4.0
+```
+
+## Key Files Referenced
+- `backend/app/websocket/handlers.py` - WebSocket protocol (lines 42-58 for file handling, 211 for persistence gate)
+- `backend/app/models/schemas.py` - Canonical SubReview/AgentReview schemas to mirror
+- `frontend/components/PDFReport.tsx` - PDF layout, brand colors, and structure to replicate
+- `frontend/services/geminiService.ts` - WebSocket client reference implementation
+- `backend/app/services/gemini_service.py` - Gemini API client pattern (backend reference only)
+
+## Verification
+1. Place test assets in `testing/assets/`, create a CSV with 2-3 assets
+2. Start backend with `DISABLE_AUTH=true`
+3. Run: `cd testing && python ivu_runner.py --csv sample_assets.csv --iterations 2`
+4. Verify per-run PDFs are generated in output directory
+5. Verify `results.json` contains all scores and timing data
+6. Verify comparative report HTML opens in browser with correct data
+7. Run again with `--baseline output/{previous}/results.json` to verify drift detection