diff --git a/IVU_TESTING_PLAN.md b/IVU_TESTING_PLAN.md new file mode 100644 index 0000000..723e6a5 --- /dev/null +++ b/IVU_TESTING_PLAN.md @@ -0,0 +1,152 @@ +# IVU Testing & Performance Monitoring Script + +## Context + +Barclays IVU (Independent Validation Unit) Model compliance requires a structured methodology for testing and reporting on AI model performance. Per Section 7.1 of the requirements, the process involves ingesting 20 predefined assets, running them 5 times each, and comparing reports across Quality, Consistency, Accuracy, Completeness, and Response Time. This runs every 6 months to detect model drift and benchmark past responses. + +This is a standalone Python CLI tool that connects to the running ModComms backend via WebSocket, runs batch analyses, scores results, and generates comparative reports. It does NOT live inside the site itself. + +## Directory Structure + +``` +testing/ +├── ivu_runner.py # Main CLI entry point +├── config.py # CLI args + env var configuration +├── models.py # Pydantic data models for IVU results +├── utils.py # CSV parsing, file I/O, MIME detection +├── ws_client.py # Async WebSocket client +├── scoring/ +│ ├── __init__.py +│ ├── ai_scorer.py # Claude-based Quality/Accuracy/Completeness scorer (Anthropic API) +│ └── consistency_scorer.py # Programmatic consistency comparison across runs +├── reporting/ +│ ├── __init__.py +│ ├── pdf_report.py # Per-run PDF report (reportlab) +│ ├── comparative_report.py # Final HTML + PDF comparative summary +│ └── styles.py # Shared brand colors and styling constants +├── assets/ # Where the 20 test proof files go +│ └── .gitkeep +├── output/ # Default output directory (gitignored) +│ └── .gitkeep +├── requirements.txt # Standalone deps +└── sample_assets.csv # Example CSV template +``` + +## Implementation Steps + +### Step 1: Data Models (`testing/models.py`) +Define Pydantic models mirroring backend schemas from `backend/app/models/schemas.py`: +- `SubReviewResult` - mirrors `SubReview` (ragStatus, feedback, issues) +- `AgentReviewResult` - mirrors `AgentReview` (4 agent reviews + lead summary + overallStatus) +- `SingleRunResult` - one iteration result (agent_review, elapsed_seconds, iteration, error) +- `AssetTestResult` - all runs for one asset + computed scores +- `IVUTestSuite` - top-level model for the entire test run + +### Step 2: Configuration (`testing/config.py`) +argparse-based CLI with settings: +- `--csv` (required) - path to CSV file +- `--assets-dir` (default: `./assets`) - directory containing proof files +- `--backend-url` (default: `ws://localhost:8000/ws/analyze`) - WebSocket endpoint +- `--iterations` (default: `5`) - runs per asset +- `--output-dir` (default: `./output/{timestamp}`) +- `--anthropic-api-key` (from env `ANTHROPIC_API_KEY`) - for Claude-based scoring +- `--access-token` (default: `ivu-test-token` for DISABLE_AUTH=true) +- `--delay-between-runs` (default: `5` seconds) +- `--baseline` - path to previous `results.json` for drift comparison +- `--skip-scoring` / `--skip-pdf` flags + +### Step 3: CSV Parsing & Utilities (`testing/utils.py`) +CSV columns: `campaign_name`, `brand_guidelines`, `client_lead`, `agency`, `agency_lead`, `proof_name`, `channel`, `sub_channel`, `proof_type`, `proof_file_name` + +Validates all required columns exist and all referenced files exist in assets directory before starting. + +### Step 4: WebSocket Client (`testing/ws_client.py`) +Async client using `websockets` library that: +- Reads file, base64-encodes it, detects MIME type +- Sends `{"type": "analyze", ...}` message (intentionally omits `campaign_id` and `proof_name` to avoid persisting test runs to the production database - see `handlers.py:211`) +- Listens for `agent_started`, `agent_completed`, `model_fallback`, `complete`, `error` messages +- Returns structured result with timing +- 30s connection timeout, 300s analysis timeout, 3 retries with exponential backoff on connection failures + +### Step 5: AI-Based Scoring (`testing/scoring/ai_scorer.py`) +Uses **Claude (Anthropic API)** as an independent judge to evaluate each run's output. This is a stronger IVU approach than using Gemini to score itself - a completely separate AI provider evaluates the analysis quality. +- Sends the proof image + agent review text to Claude with a detailed rubric +- Uses `anthropic` Python SDK with Claude Sonnet for cost-effective scoring +- Returns structured JSON scores (1-10) for: + - **Quality**: Clarity, coherence, professional tone, no hallucinations, grammatical correctness + - **Accuracy**: Information matches visible proof content, correct facts/numbers, appropriate RAG status + - **Completeness**: All agent perspectives covered, legal/brand/channel aspects addressed, no omissions +- Requires `ANTHROPIC_API_KEY` env var (separate from the Gemini key used by the backend) + +### Step 6: Consistency Scoring (`testing/scoring/consistency_scorer.py`) +Programmatic comparison across iterations (no AI needed): +- **RAG Status Consistency (25%)**: Unanimous RAG status across runs per agent +- **Overall Status Consistency (25%)**: Same overallStatus across all runs +- **Issue Overlap (25%)**: Fuzzy Jaccard similarity of issue lists using `difflib.SequenceMatcher` +- **Feedback Similarity (25%)**: Normalized text comparison of feedback across runs + +### Step 7: Per-Run PDF Reports (`testing/reporting/pdf_report.py`) +Uses `reportlab` to generate PDFs matching the frontend's `PDFReport.tsx` layout: +- Cover page with Oliver brand, campaign info, iteration number +- Overall summary with RAG status and lead agent summary +- 4 agent review sections with RAG badges, feedback text, and key actions lists +- Brand colors: Navy `#001f5a`, Blue `#00a3e0`, RAG colors from PDFReport.tsx + +### Step 8: Comparative Report (`testing/reporting/comparative_report.py`) +Generates both HTML and PDF: +- **Executive Summary**: Run metadata, aggregate pass/fail rates, average scores +- **Score Dashboard Table**: One row per asset with avg Quality, Accuracy, Completeness, Consistency, Response Time +- **Per-Asset Detail**: Score breakdown by iteration, RAG distribution, response time stats +- **Model Drift Analysis**: If baseline provided, delta per asset per metric, flagged regressions (>1 point drop) + +### Step 9: Main Runner (`testing/ivu_runner.py`) +Orchestrates the full flow: +1. Parse CSV → validate files exist +2. For each asset (20): + - For each iteration (5): + - Run WebSocket analysis → collect result + timing + - Generate per-run PDF + - Score with AI scorer (Quality, Accuracy, Completeness) + - Delay between runs + - Compute consistency score across all iterations +3. Save raw `results.json` (for future baseline comparisons) +4. Generate comparative report (HTML + PDF) +5. Handle Ctrl+C gracefully (save partial results) + +## Output Structure +``` +output/run_2026-03-03_14-30-00/ +├── results.json # Machine-readable results (future baseline) +├── comparative_report.html # Browser-viewable report +├── comparative_report.pdf # PDF comparative report +├── per_run_pdfs/ +│ ├── Asset01_ProofName_run1.pdf +│ ├── Asset01_ProofName_run2.pdf +│ └── ... # 20 assets × 5 runs = 100 PDFs +└── logs/ + └── ivu_runner.log # Execution log +``` + +## Dependencies (`testing/requirements.txt`) +``` +websockets>=12.0 +anthropic>=0.40.0 +pydantic>=2.5.0 +reportlab>=4.0 +``` + +## Key Files Referenced +- `backend/app/websocket/handlers.py` - WebSocket protocol (lines 42-58 for file handling, 211 for persistence gate) +- `backend/app/models/schemas.py` - Canonical SubReview/AgentReview schemas to mirror +- `frontend/components/PDFReport.tsx` - PDF layout, brand colors, and structure to replicate +- `frontend/services/geminiService.ts` - WebSocket client reference implementation +- `backend/app/services/gemini_service.py` - Gemini API client pattern (backend reference only) + +## Verification +1. Place test assets in `testing/assets/`, create a CSV with 2-3 assets +2. Start backend with `DISABLE_AUTH=true` +3. Run: `cd testing && python ivu_runner.py --csv sample_assets.csv --iterations 2` +4. Verify per-run PDFs are generated in output directory +5. Verify `results.json` contains all scores and timing data +6. Verify comparative report HTML opens in browser with correct data +7. Run again with `--baseline output/{previous}/results.json` to verify drift detection