modcomms/IVU_TESTING_PLAN.md
Vadym Samoilenko d7fd435210 Add IVU Testing & Performance Monitoring implementation plan
Standalone Python CLI tool plan for Barclays IVU Model compliance testing:
batch WebSocket analysis, AI-based scoring via Claude, consistency metrics,
per-run PDF reports, and comparative drift detection reports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:28:56 +00:00

152 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# IVU Testing & Performance Monitoring Script
## Context
Barclays IVU (Independent Validation Unit) Model compliance requires a structured methodology for testing and reporting on AI model performance. Per Section 7.1 of the requirements, the process involves ingesting 20 predefined assets, running them 5 times each, and comparing reports across Quality, Consistency, Accuracy, Completeness, and Response Time. This runs every 6 months to detect model drift and benchmark past responses.
This is a standalone Python CLI tool that connects to the running ModComms backend via WebSocket, runs batch analyses, scores results, and generates comparative reports. It does NOT live inside the site itself.
## Directory Structure
```
testing/
├── ivu_runner.py # Main CLI entry point
├── config.py # CLI args + env var configuration
├── models.py # Pydantic data models for IVU results
├── utils.py # CSV parsing, file I/O, MIME detection
├── ws_client.py # Async WebSocket client
├── scoring/
│ ├── __init__.py
│ ├── ai_scorer.py # Claude-based Quality/Accuracy/Completeness scorer (Anthropic API)
│ └── consistency_scorer.py # Programmatic consistency comparison across runs
├── reporting/
│ ├── __init__.py
│ ├── pdf_report.py # Per-run PDF report (reportlab)
│ ├── comparative_report.py # Final HTML + PDF comparative summary
│ └── styles.py # Shared brand colors and styling constants
├── assets/ # Where the 20 test proof files go
│ └── .gitkeep
├── output/ # Default output directory (gitignored)
│ └── .gitkeep
├── requirements.txt # Standalone deps
└── sample_assets.csv # Example CSV template
```
## Implementation Steps
### Step 1: Data Models (`testing/models.py`)
Define Pydantic models mirroring backend schemas from `backend/app/models/schemas.py`:
- `SubReviewResult` - mirrors `SubReview` (ragStatus, feedback, issues)
- `AgentReviewResult` - mirrors `AgentReview` (4 agent reviews + lead summary + overallStatus)
- `SingleRunResult` - one iteration result (agent_review, elapsed_seconds, iteration, error)
- `AssetTestResult` - all runs for one asset + computed scores
- `IVUTestSuite` - top-level model for the entire test run
### Step 2: Configuration (`testing/config.py`)
argparse-based CLI with settings:
- `--csv` (required) - path to CSV file
- `--assets-dir` (default: `./assets`) - directory containing proof files
- `--backend-url` (default: `ws://localhost:8000/ws/analyze`) - WebSocket endpoint
- `--iterations` (default: `5`) - runs per asset
- `--output-dir` (default: `./output/{timestamp}`)
- `--anthropic-api-key` (from env `ANTHROPIC_API_KEY`) - for Claude-based scoring
- `--access-token` (default: `ivu-test-token` for DISABLE_AUTH=true)
- `--delay-between-runs` (default: `5` seconds)
- `--baseline` - path to previous `results.json` for drift comparison
- `--skip-scoring` / `--skip-pdf` flags
### Step 3: CSV Parsing & Utilities (`testing/utils.py`)
CSV columns: `campaign_name`, `brand_guidelines`, `client_lead`, `agency`, `agency_lead`, `proof_name`, `channel`, `sub_channel`, `proof_type`, `proof_file_name`
Validates all required columns exist and all referenced files exist in assets directory before starting.
### Step 4: WebSocket Client (`testing/ws_client.py`)
Async client using `websockets` library that:
- Reads file, base64-encodes it, detects MIME type
- Sends `{"type": "analyze", ...}` message (intentionally omits `campaign_id` and `proof_name` to avoid persisting test runs to the production database - see `handlers.py:211`)
- Listens for `agent_started`, `agent_completed`, `model_fallback`, `complete`, `error` messages
- Returns structured result with timing
- 30s connection timeout, 300s analysis timeout, 3 retries with exponential backoff on connection failures
### Step 5: AI-Based Scoring (`testing/scoring/ai_scorer.py`)
Uses **Claude (Anthropic API)** as an independent judge to evaluate each run's output. This is a stronger IVU approach than using Gemini to score itself - a completely separate AI provider evaluates the analysis quality.
- Sends the proof image + agent review text to Claude with a detailed rubric
- Uses `anthropic` Python SDK with Claude Sonnet for cost-effective scoring
- Returns structured JSON scores (1-10) for:
- **Quality**: Clarity, coherence, professional tone, no hallucinations, grammatical correctness
- **Accuracy**: Information matches visible proof content, correct facts/numbers, appropriate RAG status
- **Completeness**: All agent perspectives covered, legal/brand/channel aspects addressed, no omissions
- Requires `ANTHROPIC_API_KEY` env var (separate from the Gemini key used by the backend)
### Step 6: Consistency Scoring (`testing/scoring/consistency_scorer.py`)
Programmatic comparison across iterations (no AI needed):
- **RAG Status Consistency (25%)**: Unanimous RAG status across runs per agent
- **Overall Status Consistency (25%)**: Same overallStatus across all runs
- **Issue Overlap (25%)**: Fuzzy Jaccard similarity of issue lists using `difflib.SequenceMatcher`
- **Feedback Similarity (25%)**: Normalized text comparison of feedback across runs
### Step 7: Per-Run PDF Reports (`testing/reporting/pdf_report.py`)
Uses `reportlab` to generate PDFs matching the frontend's `PDFReport.tsx` layout:
- Cover page with Oliver brand, campaign info, iteration number
- Overall summary with RAG status and lead agent summary
- 4 agent review sections with RAG badges, feedback text, and key actions lists
- Brand colors: Navy `#001f5a`, Blue `#00a3e0`, RAG colors from PDFReport.tsx
### Step 8: Comparative Report (`testing/reporting/comparative_report.py`)
Generates both HTML and PDF:
- **Executive Summary**: Run metadata, aggregate pass/fail rates, average scores
- **Score Dashboard Table**: One row per asset with avg Quality, Accuracy, Completeness, Consistency, Response Time
- **Per-Asset Detail**: Score breakdown by iteration, RAG distribution, response time stats
- **Model Drift Analysis**: If baseline provided, delta per asset per metric, flagged regressions (>1 point drop)
### Step 9: Main Runner (`testing/ivu_runner.py`)
Orchestrates the full flow:
1. Parse CSV → validate files exist
2. For each asset (20):
- For each iteration (5):
- Run WebSocket analysis → collect result + timing
- Generate per-run PDF
- Score with AI scorer (Quality, Accuracy, Completeness)
- Delay between runs
- Compute consistency score across all iterations
3. Save raw `results.json` (for future baseline comparisons)
4. Generate comparative report (HTML + PDF)
5. Handle Ctrl+C gracefully (save partial results)
## Output Structure
```
output/run_2026-03-03_14-30-00/
├── results.json # Machine-readable results (future baseline)
├── comparative_report.html # Browser-viewable report
├── comparative_report.pdf # PDF comparative report
├── per_run_pdfs/
│ ├── Asset01_ProofName_run1.pdf
│ ├── Asset01_ProofName_run2.pdf
│ └── ... # 20 assets × 5 runs = 100 PDFs
└── logs/
└── ivu_runner.log # Execution log
```
## Dependencies (`testing/requirements.txt`)
```
websockets>=12.0
anthropic>=0.40.0
pydantic>=2.5.0
reportlab>=4.0
```
## Key Files Referenced
- `backend/app/websocket/handlers.py` - WebSocket protocol (lines 42-58 for file handling, 211 for persistence gate)
- `backend/app/models/schemas.py` - Canonical SubReview/AgentReview schemas to mirror
- `frontend/components/PDFReport.tsx` - PDF layout, brand colors, and structure to replicate
- `frontend/services/geminiService.ts` - WebSocket client reference implementation
- `backend/app/services/gemini_service.py` - Gemini API client pattern (backend reference only)
## Verification
1. Place test assets in `testing/assets/`, create a CSV with 2-3 assets
2. Start backend with `DISABLE_AUTH=true`
3. Run: `cd testing && python ivu_runner.py --csv sample_assets.csv --iterations 2`
4. Verify per-run PDFs are generated in output directory
5. Verify `results.json` contains all scores and timing data
6. Verify comparative report HTML opens in browser with correct data
7. Run again with `--baseline output/{previous}/results.json` to verify drift detection