First real-data test against the AXA car-insurance PDFs surfaced a noise problem: the new document is a brand refresh — every page flips font (PublicoBanner-Bold→PublicoHeadline-Bold) and colour (#893f4a→#2e3092). At medium-per-finding that crashed the diff score to 0.0 and drowned the bold-regression signal AXA actually flagged. Drop font, size, colour comparators. Keep bold + italic — the attributes the vision-LLM consistently misses on dense layouts. The LLM already narrates colour-scheme rebrands and font swaps in its Modified / Style-changes blocks; running both layers on the same visual change just double-counts it. Tests inverted from "X change is flagged" to "X change is NOT flagged" to lock the scope decision in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.5 KiB
AXA Client Documentation
Referenced from main CLAUDE.md. Detailed AXA QC profile descriptions, document-mode pipeline notes, and status.
Overview
AXA QC is built around document-mode — multi-page PDF analysis (policy documents, forms, brochures), not single-asset image checks. The document-mode subsystem (backend/document_mode/) was built for AXA and is now reused by Boots Production Pack.
Status (2026-05-10): Phases 1, 3, 4, 5, 6 merged to develop and live on dev (https://optical-dev.oliver.solutions/ai_qc/). Phase 6 wires veraPDF into the accessibility check (PAC-equivalent PDF/UA-1 validation) and splits accessibility into its own dedicated profile. Email to AXA pending — explains Adobe vs PAC + veraPDF parity findings + requests the original axa-transaction-charges-100326.pdf so we can run a true apples-to-apples comparison. Not yet on prod — held for AXA show-and-tell + email response. Full plan in backend/AXA_DOCUMENT_MODE_PLAN.md.
AXA Profiles
axa_policy_document — single-document mode (7 checks)
Multi-page policy document QC. mode: document, scopes vary per check. Accessibility validation lives in the dedicated axa_accessibility profile, not here.
| Check | What it does | Weight |
|---|---|---|
axa_font_inventory |
Per-page font extraction + brand-font compliance against AXA's approved font list | 1.0 |
axa_phone_inventory |
Extracts phone numbers across pages, validates format and approved-list membership | 1.0 |
axa_bold_words_definitions |
Bold-word inventory + definition cross-check (seed list at backend/document_mode/data/axa_bold_words_seed.json) |
2.0 |
axa_page_numbering |
Page numbering format and continuity | 1.0 |
axa_print_preflight |
Print-preflight checks (color space, embedded fonts, image resolution) | 1.0 |
axa_print_code |
Print code presence + format | 1.0 |
axa_omg_versioning |
OMG version footer/header presence and consistency | 1.0 |
axa_accessibility — accessibility-only mode (1 check, strict-grade)
mode: document, strict_grade: true. Standalone PDF/UA-1 validation for users who only need to check accessibility compliance without the full policy-document content suite. Mirrors how axes4 PAC is used — single-purpose, binary verdict.
| Check | What it does | Weight |
|---|---|---|
axa_pdf_accessibility |
PDF/UA-1 validation via veraPDF (matches axes4 PAC), with deterministic PyMuPDF fallback if veraPDF is not installed | 1.0 |
axa_policy_document_diff — old-vs-new diff mode (1 check)
mode: document_diff — compares two PDFs (old vs new policy version) and reports structured changes.
| Check | What it does | Weight |
|---|---|---|
axa_pdf_diff |
Detects added/removed/modified pages, paragraphs, defined terms, phone numbers | 1.0 |
Document-mode infrastructure
AXA's document-mode subsystem is the foundation for all multi-page PDF QC in this app:
document_mode/ingest.py— PDF ingestion, page rendering, span/font/color extraction via PyMuPDFdocument_mode/dispatcher.py— Orchestrates per-check execution against pages, supports scopes:document/targeted/page_sample/page_pair/page_eachdocument_mode/checks.py,print_preflight_checks.py,accessibility_checks.py— AXA check implementationsdocument_mode/diff_engine.py,diff_report_writer.py— Old-vs-new diff handlingdocument_mode/result_writer.py— HTML report rendering with per-page sections
Boots Production Pack reuses this entire spine — so any infra changes here affect both clients.
AI usage across AXA tools
For client-facing context: 8 of 9 AXA tools are deterministic (no LLM, $0 cost, runs in seconds). Only axa_pdf_diff uses AI — Gemini 2.5 Pro vision-LLM page-pair comparison at ~$0.40-0.80 per pair, supplemented by a deterministic PyMuPDF span comparator that catches bold/italic flips the vision-LLM misses (font/size/colour changes are left to the LLM narrative diff — flagging them deterministically drowns out the bold/italic regressions on re-branded documents). The accessibility check uses veraPDF, which is a rule-based open-source PDF/UA-1 validator — not AI. This framing matters when clients conflate "automation" with "AI".
| Tool | Type | Engine |
|---|---|---|
axa_font_inventory, axa_phone_inventory, axa_bold_words_definitions, axa_page_numbering, axa_print_code, axa_omg_versioning |
Deterministic | PyMuPDF (text + font extraction, regex) |
axa_print_preflight |
Deterministic | PyMuPDF (page geometry, image colour spaces, DPI, transparency, PDF/X) |
axa_pdf_accessibility |
Deterministic (rule-based) | veraPDF subprocess (PDF/UA-1 / Matterhorn Protocol) + PyMuPDF fallback |
axa_pdf_diff |
AI + deterministic | Gemini 2.5 Pro vision-LLM (content + font/size/colour narrative) + PyMuPDF span comparator (bold/italic flip detection) |
Open items
- AXA show-and-tell pending — feedback will drive the next round of tuning
- Awaiting
axa-transaction-charges-100326.pdffrom AXA (the file PAC was run against) — needed to fully confirm veraPDF↔PAC parity on the Structure Elements rule bucket - Phase 2 (any further check expansion) deferred until after show-and-tell
- Canonical AXA font list / approved phone list / OMG version reference data may need expansion as test PDFs surface gaps
- Prod deployment of veraPDF +
axa_accessibilityprofile — held until AXA confirms findings on dev
veraPDF deployment
axa_pdf_accessibility runs the veraPDF PDF/UA-1 validator as a subprocess when the binary is available. veraPDF implements the Matterhorn Protocol — the same rule set axes4 PAC uses — so its verdict is the closest open-source equivalent to PAC.
Binary resolution order (in accessibility_checks._resolve_verapdf_binary):
VERAPDF_BINenv varverapdfon PATH/opt/ai_qc/vendor/verapdf/verapdf(project-local production install)
If veraPDF isn't installed the check falls back to the 9-criterion deterministic PyMuPDF layer — no breakage, just less depth. Production install pattern is a project-local bundled-JRE tarball under /opt/ai_qc/vendor/verapdf/ to avoid touching system Java or other projects on shared servers.
Key files
backend/AXA_DOCUMENT_MODE_PLAN.md— full design plan and phase breakdownbackend/document_mode/— pipeline implementationbackend/profiles/axa_policy_document.json,axa_accessibility.json,axa_policy_document_diff.jsonbackend/document_mode/data/axa_bold_words_seed.json— bold-word seed list