# AXA Document-Mode QC — Build Plan Multi-page PDF QC pipeline for AXA Ireland. Different from every other client onboard because the QC *target* is an 80-page policy PDF, not a single image/video. Sources of requirements: - `axa_ireland/AXA build guide new tools required.txt` — original ask + first scoping - `axa_ireland/Exampole Folders explained.rtf` — what the example folders represent - `axa_ireland/Example 1/` — old vs new (post brand-refresh) Home Insurance policy + the human QC checklist (`Policy documents QC CHECKS.docx`) - `axa_ireland/Example 2/` — Landlord Insurance v1 (shipped with errors), V10 (corrected), V1 2025 amends, original master. **Phase-3 diff test pair: V1 vs V10.** ## Sequencing (updated 2026-05-01) Inverted from "discover everything → build" to "build minimum demo → show client → drive deeper discovery from their reactions". 1. **Phase 1 refactor** ✅ DONE on `feature/axa-document-mode` 2026-05-01 — scope-aware dispatcher, 6 deterministic checks, $0 LLM cost, runs in seconds. Replaces broken Phase-1 stub. 2. **Phase 3 build** (next) — old-vs-new diff using vision-LLM page-pair, no LlamaParse account needed. 3. **Show-and-tell with AXA** — demonstrate Phases 1 + 3 working, show costs, gather requirements (font list, bold-words list, phone numbers, WCAG target, print preflight scope). 4. **Phases 2, 4, 5** in order, with local testing between each. 5. Local-only until show-and-tell; do not push `feature/axa-document-mode` to dev or merge to develop yet. ## Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation) | Question | Decision | |---|---| | Approved Monotype font list | **Not yet supplied.** Until then, `axa_font_inventory` lists fonts only (informational). When list arrives, becomes `axa_font_compliance`. | | Bold-words dictionary | **Bootstrap from Example 1 General Definitions** (pages 8–10) until AXA supplies canonical list. 35 terms extracted, saved to `backend/document_mode/data/axa_bold_words_seed.json`. Some short terms (`you`, `your`, `we`) produce false positives — accept until canonical list lands. | | Approved phone number list | **Not yet supplied.** `axa_phone_inventory` lists numbers only. Becomes `axa_phone_compliance` when list arrives. | | WCAG target | **AAA** (Phase 4 veraPDF profile setting) | | Print preflight scope | **"Is it print-ready?"** simple version. Expand later if needed. | | Page sampling defaults | **N=8** for visual sanity, **N=5** for print sanity | | Volume expectation | Still pending — drives later cost decisions | ## Architectural choice **One web UI, isolated backend module.** Doc-mode shares the existing shell (auth, client picker, Settings, Reporting, Admin, user-access, output history) and runs as a third mode alongside static + video. New code lives in `backend/document_mode/` with new endpoints under `/api/document/*`, gated by `mode: "document"` on the profile JSON. Existing single-asset clients are not touched. Rejected: a separate `/ai_qc/documents/` page. Would re-implement the shell for one client and fork future doc-mode work per-client. ## Scope-aware dispatcher (Phase 1 refactor, 2026-05-01) Each check declares its `scope` in the profile JSON. The dispatcher routes by scope: | Scope | Behaviour | Cost shape | Phase 1 uses? | |---|---|---|---| | `document` | Run once over the full ingest result. Deterministic checks live in `document_mode/checks.py`. | $0, milliseconds | ✅ all 4 deterministic doc-checks | | `targeted` | Run once on specific pages (`scope_args.pages`: `first`, `last`, `first-N`, `last-N`, or list). | $0, milliseconds | ✅ print_code, omg_versioning | | `page_sample` | Run on N evenly-spaced pages via existing batch dispatcher. | N × LLM call | Phase 4/5 | | `page_pair` | Run on aligned old/new page pairs. | M × LLM call | Phase 3 | | `page_each` | Run on every page (legacy / very expensive). | N × LLM call per check | Avoid | Profile JSON shape: ```json "checks": { "axa_font_inventory": {"weight": 1.0, "scope": "document", "enabled": true}, "axa_print_code": {"weight": 1.0, "scope": "targeted", "scope_args": {"pages": "last"}, "enabled": true} } ``` The scope field is optional in the QCCheckConfig dataclass — defaults to None, which the doc-mode dispatcher treats as `page_each` for legacy compatibility. Asset-mode pipeline ignores scope entirely. ## Phased delivery Each phase ships as its own tag so we can demo / rollback in slices. ### Phase 0 — Discovery (you, before further code) Need from AXA before building Phase 2+: - [ ] Approved **Monotype font list** (family + weights). Brand refresh moved them off old BOX-licensed fonts; we need the canonical list for `axa_font_compliance`. - [ ] **Bold-words dictionary** for General Definitions (Example 2 says 70+). - [ ] 2–3 more old/new PDF pairs beyond Examples 1 & 2 (ideally Motor Insurance for diversity). - [ ] WCAG target — AA or AAA. EAA scope confirmation. - [ ] Print preflight scope — "is it print-ready?" or full PDF/X-1a/4 compliance. - [ ] Volume expectation — 80 pages × how many docs/month. ### Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01 ✅ **Original Phase 1 (2026-04-29):** spine only, ran existing image-based `accessibility` per page on all 86 pages. Smoke test ran in ~70min for ~$0.50. Output report revealed every "failing page" failed for the same false-positive reason: *"document is presented as an image of text / WCAG 1.4.5"* — the LLM was critiquing our PNG rendering pipeline, not AXA's actual PDF. Result: noisy 67.7/Pass with no actionable findings. **Refactor (2026-05-01) on `feature/axa-document-mode`:** scope-aware dispatcher + 6 deterministic doc-scope checks. Same Home Insurance PDF now scores 81.4/Pass in seconds at $0 cost, with real findings: font inventory, phone-number inventory, 132 non-bold defined-term occurrences flagged across 53 pages, 5 page-numbering discontinuities, print code "AG400 11/25" detected. Files added: - `backend/document_mode/__init__.py` - `backend/document_mode/ingest.py` — multi-page PDF → per-page PNGs + per-span structured text (font, size, bold flag, italic flag, bbox). Uses PyMuPDF. Bold detection = `flags & 16` OR font name contains `bold|black|heavy`. Default zoom 2.0×, max dim 1600 px, page cap 200. - `backend/document_mode/dispatcher.py` — **scope-aware** routing. `document` and `targeted` checks bypass LLM; `page_sample` / `page_each` use existing batch dispatcher; `page_pair` reserved for Phase 3. - `backend/document_mode/checks.py` — registry of 6 deterministic doc-scope checks (see table below). Each returns `{check_name, scope, score, pass, summary, findings, response}`. - `backend/document_mode/data/axa_bold_words_seed.json` — bootstrap dictionary, 35 terms extracted from Example 1 General Definitions (pages 8-10). - `backend/document_mode/result_writer.py` — writes JSON + self-contained HTML with: at-a-glance findings table, per-check sections with structured renderers (font/phone/bold-words/page-numbering/print-code/OMG each get their own table), per-page summary strip. Reports collapsed by default. - `backend/profiles/axa_policy_document.json` — production profile with `mode: "document"`, 6 deterministic checks, `visibility: client_specific, visible_to_clients: ["axa"]`. ### The 6 Phase-1 deterministic checks | Check | Scope | What it does | Becomes (when client supplies data) | |---|---|---|---| | `axa_font_inventory` | document | Lists every unique font + per-page distribution | `axa_font_compliance` (flags non-approved) | | `axa_phone_inventory` | document | Regex-extracts every phone-shaped number, dedup, with page refs | `axa_phone_compliance` (flags non-approved) | | `axa_bold_words_definitions` | document | Scans for seed-dictionary terms, flags non-bold occurrences | Same — replace seed dict with AXA's canonical list | | `axa_page_numbering` | document | Detects standalone-line integers near top/bottom, flags discontinuities | Same | | `axa_print_code` | targeted: last | Finds back-page print/version line components (code + ref + date + version) | Same — refine regex once AXA confirms format | | `axa_omg_versioning` | targeted: last | Finds OMG code + date format on back page | Same | Files modified: - `backend/profile_config.py` — `Profile.mode` field defaults to `"asset"`. `QCCheckConfig` gains `scope` and `scope_args` fields, both optional. Persisted only when non-default. - `backend/api_server.py` — `POST /api/document/start_analysis` endpoint. The `enabled_checks` filter accepts checks from the document-mode registry (`is_document_scope_check`) in addition to the legacy `qc_apps` registry, so deterministic AXA checks aren't filtered out. - `backend/client_config.py` — AXA client gains `axa_policy_document` as first profile. - `web_ui.html` — doc-mode banner under upload area, file-input `accept` swapped to PDF-only, `performAnalysisWithProgress` routes to `/api/document/start_analysis` with `client_id`. **Smoke-tested 2026-05-01 (post-refactor) against same Home Insurance PDF:** - Score: 81.4 / 100 (Pass) - Total runtime: a few seconds (deterministic only, no LLM calls) - Total cost: $0 - Findings: 10 fonts, 8 phone numbers, 132 non-bold defined-term occurrences across 53 pages, 5 page-number discontinuities, print code "AG400" + "11/25" detected on back page, no OMG present. - Smoke-test report: `backend/output-dev/axa/PHASE1_REFACTOR_smoke_test_report.html` **Local test plan (after `./scripts/run-local.sh`):** 1. Pick AXA client → AXA Policy Document profile 2. Upload `axa_ireland/Example 1/6317047 - AXA - Home Insurance Policy 2025 V8 final new brand.pdf` 3. Verify: doc-mode banner, PDF-only picker, progress completes in seconds, report appears with at-a-glance table + per-check sections + structured findings. **Known gotchas to surface during demo:** - Bold-words bootstrap dictionary contains short terms (`you`, `your`, `we`, `us`) which produce false positives in normal pronoun usage. Mitigated by Phase-2 work (canonical list from AXA). - Page-numbering heuristic catches TOC-page numbers as false-positive "page numbers" (5 such hits in this doc). Surface as data, score gently. - Print-code regex tuned to "AG400 11/25" pattern observed in Example 1; may need tuning for other docs. ### Phase 2 — Deterministic checks (~3–5 days) The cheap, accurate wins. No LLM cost. - `backend/document_mode/font_compliance.py` — reads PDF font inventory, flags anything not on AXA's approved list. Per-page failure log. Plugs into `process_single_check` via the same early-branch pattern as `dj_file_naming` (line ~384 in `api_server.py`). - `backend/document_mode/bold_words.py` — scans pages for AXA's bold-words dictionary, flags any occurrence not rendered bold. - `backend/document_mode/print_code.py` — extracts back-page print code, optionally compares to brief-supplied value. - `backend/document_mode/omg_versioning.py` — confirms back-page OMG number + date format compliance (regex pattern, similar to `file_naming_validator`). - New Settings → "AXA Configuration" tab for uploading approved fonts list + bold-words list per client (same pattern as media plan upload). ### Phase 3 — Old-vs-new diff — DONE on `feature/axa-document-mode` 2026-05-01 ✅ Vision-LLM-based page-pair diff. Validates the original Example-2 promise: catches the bold-formatting fixes, structural changes, definition updates, and content additions/removals that V1 missed and V10 fixed. **Files added:** - `backend/document_mode/diff_engine.py` — page alignment via difflib SequenceMatcher (windowed fuzzy match, threshold 0.4) + parallel page-pair vision-LLM diff via Gemini 2.5 Pro (8 concurrent). Returns alignment map + structured diff JSON per pair (added/removed/modified/moved/style_changes/severity). - `backend/document_mode/diff_report_writer.py` — diff-specific HTML/JSON. Versions card, at-a-glance grid (page count delta, severity counts), full alignment table, per-pair cards with severity pills + categorised diff blocks. - `backend/profiles/axa_policy_document_diff.json` — `mode: document_diff` profile. **Files modified:** - `backend/api_server.py` — new `POST /api/document/start_diff` endpoint accepting `old_file` + `new_file`. Reuses `_require_client_access`, `progress_tracker`, `ensure_client_output_folder`, `usage_tracker`. - `backend/client_config.py` — AXA profile list gains `axa_policy_document_diff`. - `web_ui.html` — third `mode: document_diff` UX path. Two-slot drop-zone (old + new). `applyProfileMode()` swaps between asset/document/document_diff. `wireDiffPickers()` wires the dual file pickers. `startAnalysis()` + `performAnalysisWithProgress()` route diff-mode submissions to `/api/document/start_diff`. **Smoke-test 2026-05-01 — V1 (68 pages, broken) vs V10 (74 pages, corrected):** - Wall: 214 seconds (3:34) - Tokens: 214,342 (cost ≈ $0.50–$0.70) - 63 matched pairs · 11 pages added in V10 · 5 pages removed - 61 pages with differences flagged · 2 unchanged - Severity: 25 high, 32 medium, 4 low - Score 0/100, "Major changes" — correct call - Smoke-test report: `backend/output-dev/axa/phase3_smoke_*_diff_report.html` **Caught the Example-2-class defects:** - Bold-formatting changes (e.g. *"the terms 'us', 'we', and 'adviser' are now bolded"*, *"the term 'your' is now bolded"*) — exactly the missed-bold issue that motivated this build. - New Section F: Legal Protection added in V10 — structural insertion caught. - New "Period of Insurance" definition added — defined-term addition caught. - "Employee" definition expanded by a sub-point — definition modification caught. - Wording fix: *"supply your own expense" → "supply at your own expense"* — body-text correction caught. - 11 pages flagged as added, 5 as removed — page-count delta and structural restructure caught. **Cost dial (for show-and-tell):** ~$0.40–0.70 per diff against a typical 70-80-page policy. **Local test plan (UI):** 1. `./scripts/run-local.sh` 2. AXA → AXA Policy Document — Old vs New Diff 3. Pick V1.pdf as old, V10.pdf as new (both from `axa_ireland/Example 2/`) 4. Click analyse. Wait ~3-5 minutes. Report lands in saved files. ### Phase 4 — PDF accessibility — DONE on `feature/axa-document-mode` 2026-05-01 ✅ **Pure-Python implementation.** Original plan was veraPDF subprocess (Java dependency, ~150MB install). Built deterministic PyMuPDF-based check instead — no Java needed for the demo, with veraPDF as an optional add-on later. **Files added:** - `backend/document_mode/accessibility_checks.py` — 9 PDF/UA-aligned criteria checked deterministically: - **C1** Tagged PDF (StructTreeRoot present) - **C2** Marked content (/MarkInfo /Marked true) - **C3** Document title metadata - **C4** Document language (/Lang) - **C5** No password protection blocking AT - **C6** All fonts embedded - **C7** PDF version ≥ 1.5 - **C8** XMP UA-conformance declaration - **C9** Alt text on images (sampling) - Plus a `_run_verapdf()` stub for future veraPDF integration **Files modified:** - `backend/document_mode/checks.py` — `axa_pdf_accessibility` registered. - `backend/document_mode/ingest.py` — `pdf_path` added to ingest_result so doc-scope checks can read raw PDF structure. - `backend/document_mode/result_writer.py` — `_render_pdf_accessibility` structured renderer (criteria checklist with pass/fail markers). - `backend/profiles/axa_policy_document.json` — `axa_pdf_accessibility` added at weight 2.0. **Smoke-test 2026-05-01 against Example 1 Home Insurance V8 (Adobe InDesign output):** - Overall AXA Policy Document score: 80.6 / Pass (7 checks, $0 cost, runs in seconds) - Accessibility check: 7.78 / 10 (7 of 9 criteria passed) - Real gaps caught: - **C7 fail:** PDF 1.4 — should be 1.5+ for full accessibility tagging support - **C8 fail:** No PDF/UA-1 conformance flag in XMP metadata - Pass: tagged structure, marked content, title set, /Lang=en, no encryption, all 10 fonts embedded, alt-text entries detected **veraPDF integration plan (when ready):** 1. Install veraPDF on host: https://verapdf.org/software/ (requires JRE 8+, ~150MB) 2. Ensure `verapdf` binary on PATH or set `VERAPDF_BIN` env var 3. Replace `_run_verapdf()` stub with `subprocess.run([verapdf, '--format', 'json', '--profile', 'ua1', pdf_path], capture_output=True)` and merge JSON findings into `axa_pdf_accessibility`'s output 4. Set `findings['verapdf_run'] = True` ### Phase 5 — Print preflight — DONE on `feature/axa-document-mode` 2026-05-01 ✅ **Pure-Python implementation.** Original plan was Ghostscript-based; built deterministic PyMuPDF checks instead — same approach as Phase 4. Ghostscript can plug in later for total-ink-coverage / registration-black if scope grows. **Files added:** - `backend/document_mode/print_preflight_checks.py` — 7 deterministic preflight criteria: - **PP1** Page geometry consistency (single MediaBox size across all pages) - **PP2** Bleed area defined (TrimBox/BleedBox differ from MediaBox) - **PP3** Image colour spaces (flag DeviceRGB; press wants CMYK/Gray) - **PP4** Image effective DPI (raw pixels / rendered inches; flag < 150) - **PP5** Transparency / soft-mask usage (flag for flattening) - **PP6** PDF/X conformance (XMP `pdfxid:GTS_PDFXVersion`) - **PP7** Spot colour usage (flag /Separation, /DeviceN) **Files modified:** - `backend/document_mode/checks.py` — `axa_print_preflight` registered. - `backend/document_mode/result_writer.py` — `_render_print_preflight` structured renderer with low-DPI image list, colour-space breakdown, spot-colour list, page-size detail. - `backend/profiles/axa_policy_document.json` — `axa_print_preflight` added at weight 1.0. **Smoke-test 2026-05-01 against Example 1 Home Insurance V8:** - Print preflight: 5.71 / 10 (4 of 7 criteria pass) — correctly flags as digital-intent - ✓ PP1 — All 86 pages 210×297 mm (A4), consistent - ✗ PP2 — No bleed authored (digital intent — correct finding for an electronic policy) - ✓ PP3 — Only 1 grayscale image, no RGB - ✓ PP4 — Image renders at 279 DPI (above 150 threshold) - ✗ PP5 — 85 of 86 pages use transparency / soft-masks (Adobe InDesign default; would need flattening for press) - ✗ PP6 — No PDF/X conformance flag in XMP - ✓ PP7 — No spot colour spaces - Updated full-profile score: 78.25 / Pass (8 checks now) **Demo conversation:** *"If you're distributing electronically, this PDF is fine. If you're going to press, you need to (1) author bleed in InDesign, (2) flatten transparency on export, (3) declare PDF/X-1a or PDF/X-4 conformance."* ## All phases status (2026-05-01) | Phase | Scope | Status | Cost | Wall | |---|---|---|---|---| | 1 | Spine + 6 deterministic doc-scope checks | ✅ Done | $0 | seconds | | 2 | Compliance variants (font/phone/bold lists) | Blocked on AXA | — | — | | 3 | Old-vs-new diff (vision LLM page-pair) | ✅ Done | ~$0.50/run | ~3-5 min | | 4 | PDF accessibility (PyMuPDF, veraPDF stub) | ✅ Done | $0 | seconds | | 5 | Print preflight (PyMuPDF, Ghostscript later) | ✅ Done | $0 | seconds | **Demo-ready as of 2026-05-01.** All work is on `feature/axa-document-mode`, local-only, no commits or pushes yet. ## Things to flag before any further build 1. **80 pages × multiple LLM checks = serious cost.** A doc with the existing static checks running per-page would be ~$5–10 in Gemini/OpenAI calls. We should decide which LLM checks need per-page vs once over a sampled set. Most should be deterministic-only. 2. **veraPDF is Java.** Adds a JRE dependency to GCP boxes. 3. **PDF mode breaks "one upload = one report" assumption.** Decide what to save: full per-page JSON, summary only, or both. (Phase 1 saves both.) 4. **Reporting/billing.** An 80-page doc is one analysis but 80× the LLM work. We should bill it as one analysis but track total checks separately. `usage_tracker.log_analysis_complete` already gets `pages_processed` in doc mode. ## How doc-mode plugs into existing pipeline For maintenance — the integration map: | Doc-mode component | Reuses existing | Where | |---|---|---| | Per-page check execution | `process_checks_in_batches()` | `api_server.py:498` | | Per-check dispatch | `process_single_check()` | `api_server.py:377` | | LLM call | `run_visual_qc()` | `llm_config.py` | | Auth + client access | `auth.require_auth`, `_require_client_access()` | `api_server.py:4883` | | Progress polling | `/api/progress/` | `api_server.py:1695` | | Output serving | `/output//` | `api_server.py:2121` | | Output listing | `/api/output_files` | `api_server.py:2168` | | Output folder | `ensure_client_output_folder()` | `api_server.py:856` | | Profile loading | `profile_config.get_profile()` | `profile_config.py:219` | | Profile visibility | `client_config.get_profiles_with_visibility()` | `client_config.py:82` | | Usage logging | `usage_tracker.log_analysis_start/complete()` | `usage_tracker.py:73,100` | Future deterministic doc-mode checks should follow `_run_dj_file_naming_check()` (`api_server.py:348`) — short-circuit at the top of `process_single_check` before any LLM dispatch.