ai_qc/backend/AXA_DOCUMENT_MODE_PLAN.md

# AXA Document-Mode QC — Build Plan

Multi-page PDF QC pipeline for AXA Ireland. Different from every other client onboard because the QC *target* is an 80-page policy PDF, not a single image/video. Sources of requirements:

- `axa_ireland/AXA build guide new tools required.txt` — original ask + first scoping
- `axa_ireland/Exampole Folders explained.rtf` — what the example folders represent
- `axa_ireland/Example 1/` — old vs new (post brand-refresh) Home Insurance policy + the human QC checklist (`Policy documents QC CHECKS.docx`)
- `axa_ireland/Example 2/` — Landlord Insurance v1 (shipped with errors), V10 (corrected), V1 2025 amends, original master. **Phase-3 diff test pair: V1 vs V10.**

## Sequencing (updated 2026-05-01)

Inverted from "discover everything → build" to "build minimum demo → show client → drive deeper discovery from their reactions".

1. **Phase 1 refactor** ✅ DONE on `feature/axa-document-mode` 2026-05-01 — scope-aware dispatcher, 6 deterministic checks, $0 LLM cost, runs in seconds. Replaces broken Phase-1 stub.
2. **Phase 3 build** (next) — old-vs-new diff using vision-LLM page-pair, no LlamaParse account needed.
3. **Show-and-tell with AXA** — demonstrate Phases 1 + 3 working, show costs, gather requirements (font list, bold-words list, phone numbers, WCAG target, print preflight scope).
4. **Phases 2, 4, 5** in order, with local testing between each.
5. Local-only until show-and-tell; do not push `feature/axa-document-mode` to dev or merge to develop yet.

## Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation)

| Question | Decision |
|---|---|
| Approved Monotype font list | **Not yet supplied.** Until then, `axa_font_inventory` lists fonts only (informational). When list arrives, becomes `axa_font_compliance`. |
| Bold-words dictionary | **Bootstrap from Example 1 General Definitions** (pages 8–10) until AXA supplies canonical list. 35 terms extracted, saved to `backend/document_mode/data/axa_bold_words_seed.json`. Some short terms (`you`, `your`, `we`) produce false positives — accept until canonical list lands. |
| Approved phone number list | **Not yet supplied.** `axa_phone_inventory` lists numbers only. Becomes `axa_phone_compliance` when list arrives. |
| WCAG target | **AAA** (Phase 4 veraPDF profile setting) |
| Print preflight scope | **"Is it print-ready?"** simple version. Expand later if needed. |
| Page sampling defaults | **N=8** for visual sanity, **N=5** for print sanity |
| Volume expectation | Still pending — drives later cost decisions |

## Architectural choice

**One web UI, isolated backend module.** Doc-mode shares the existing shell (auth, client picker, Settings, Reporting, Admin, user-access, output history) and runs as a third mode alongside static + video. New code lives in `backend/document_mode/` with new endpoints under `/api/document/*`, gated by `mode: "document"` on the profile JSON. Existing single-asset clients are not touched.

Rejected: a separate `/ai_qc/documents/` page. Would re-implement the shell for one client and fork future doc-mode work per-client.

## Scope-aware dispatcher (Phase 1 refactor, 2026-05-01)

Each check declares its `scope` in the profile JSON. The dispatcher routes by scope:

| Scope | Behaviour | Cost shape | Phase 1 uses? |
|---|---|---|---|
| `document` | Run once over the full ingest result. Deterministic checks live in `document_mode/checks.py`. | $0, milliseconds | ✅ all 4 deterministic doc-checks |
| `targeted` | Run once on specific pages (`scope_args.pages`: `first`, `last`, `first-N`, `last-N`, or list). | $0, milliseconds | ✅ print_code, omg_versioning |
| `page_sample` | Run on N evenly-spaced pages via existing batch dispatcher. | N × LLM call | Phase 4/5 |
| `page_pair` | Run on aligned old/new page pairs. | M × LLM call | Phase 3 |
| `page_each` | Run on every page (legacy / very expensive). | N × LLM call per check | Avoid |

Profile JSON shape:
```json
"checks": {
  "axa_font_inventory": {"weight": 1.0, "scope": "document", "enabled": true},
  "axa_print_code": {"weight": 1.0, "scope": "targeted", "scope_args": {"pages": "last"}, "enabled": true}
}
```

The scope field is optional in the QCCheckConfig dataclass — defaults to None, which the doc-mode dispatcher treats as `page_each` for legacy compatibility. Asset-mode pipeline ignores scope entirely.

## Phased delivery

Each phase ships as its own tag so we can demo / rollback in slices.

### Phase 0 — Discovery (you, before further code)

Need from AXA before building Phase 2+:

- [ ] Approved **Monotype font list** (family + weights). Brand refresh moved them off old BOX-licensed fonts; we need the canonical list for `axa_font_compliance`.
- [ ] **Bold-words dictionary** for General Definitions (Example 2 says 70+).
- [ ] 2–3 more old/new PDF pairs beyond Examples 1 & 2 (ideally Motor Insurance for diversity).
- [ ] WCAG target — AA or AAA. EAA scope confirmation.
- [ ] Print preflight scope — "is it print-ready?" or full PDF/X-1a/4 compliance.
- [ ] Volume expectation — 80 pages × how many docs/month.

### Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01 ✅

**Original Phase 1 (2026-04-29):** spine only, ran existing image-based `accessibility` per page on all 86 pages. Smoke test ran in ~70min for ~$0.50. Output report revealed every "failing page" failed for the same false-positive reason: *"document is presented as an image of text / WCAG 1.4.5"* — the LLM was critiquing our PNG rendering pipeline, not AXA's actual PDF. Result: noisy 67.7/Pass with no actionable findings.

**Refactor (2026-05-01) on `feature/axa-document-mode`:** scope-aware dispatcher + 6 deterministic doc-scope checks. Same Home Insurance PDF now scores 81.4/Pass in seconds at $0 cost, with real findings: font inventory, phone-number inventory, 132 non-bold defined-term occurrences flagged across 53 pages, 5 page-numbering discontinuities, print code "AG400 11/25" detected.

Files added:
- `backend/document_mode/__init__.py`
- `backend/document_mode/ingest.py` — multi-page PDF → per-page PNGs + per-span structured text (font, size, bold flag, italic flag, bbox). Uses PyMuPDF. Bold detection = `flags & 16` OR font name contains `bold|black|heavy`. Default zoom 2.0×, max dim 1600 px, page cap 200.
- `backend/document_mode/dispatcher.py` — **scope-aware** routing. `document` and `targeted` checks bypass LLM; `page_sample` / `page_each` use existing batch dispatcher; `page_pair` reserved for Phase 3.
- `backend/document_mode/checks.py` — registry of 6 deterministic doc-scope checks (see table below). Each returns `{check_name, scope, score, pass, summary, findings, response}`.
- `backend/document_mode/data/axa_bold_words_seed.json` — bootstrap dictionary, 35 terms extracted from Example 1 General Definitions (pages 8-10).
- `backend/document_mode/result_writer.py` — writes JSON + self-contained HTML with: at-a-glance findings table, per-check sections with structured renderers (font/phone/bold-words/page-numbering/print-code/OMG each get their own table), per-page summary strip. Reports collapsed by default.
- `backend/profiles/axa_policy_document.json` — production profile with `mode: "document"`, 6 deterministic checks, `visibility: client_specific, visible_to_clients: ["axa"]`.

### The 6 Phase-1 deterministic checks

| Check | Scope | What it does | Becomes (when client supplies data) |
|---|---|---|---|
| `axa_font_inventory` | document | Lists every unique font + per-page distribution | `axa_font_compliance` (flags non-approved) |
| `axa_phone_inventory` | document | Regex-extracts every phone-shaped number, dedup, with page refs | `axa_phone_compliance` (flags non-approved) |
| `axa_bold_words_definitions` | document | Scans for seed-dictionary terms, flags non-bold occurrences | Same — replace seed dict with AXA's canonical list |
| `axa_page_numbering` | document | Detects standalone-line integers near top/bottom, flags discontinuities | Same |
| `axa_print_code` | targeted: last | Finds back-page print/version line components (code + ref + date + version) | Same — refine regex once AXA confirms format |
| `axa_omg_versioning` | targeted: last | Finds OMG code + date format on back page | Same |

Files modified:
- `backend/profile_config.py` — `Profile.mode` field defaults to `"asset"`. `QCCheckConfig` gains `scope` and `scope_args` fields, both optional. Persisted only when non-default.
- `backend/api_server.py` — `POST /api/document/start_analysis` endpoint. The `enabled_checks` filter accepts checks from the document-mode registry (`is_document_scope_check`) in addition to the legacy `qc_apps` registry, so deterministic AXA checks aren't filtered out.
- `backend/client_config.py` — AXA client gains `axa_policy_document` as first profile.
- `web_ui.html` — doc-mode banner under upload area, file-input `accept` swapped to PDF-only, `performAnalysisWithProgress` routes to `/api/document/start_analysis` with `client_id`.

**Smoke-tested 2026-05-01 (post-refactor) against same Home Insurance PDF:**
- Score: 81.4 / 100 (Pass)
- Total runtime: a few seconds (deterministic only, no LLM calls)
- Total cost: $0
- Findings: 10 fonts, 8 phone numbers, 132 non-bold defined-term occurrences across 53 pages, 5 page-number discontinuities, print code "AG400" + "11/25" detected on back page, no OMG present.
- Smoke-test report: `backend/output-dev/axa/PHASE1_REFACTOR_smoke_test_report.html`

**Local test plan (after `./scripts/run-local.sh`):**
1. Pick AXA client → AXA Policy Document profile
2. Upload `axa_ireland/Example 1/6317047 - AXA - Home Insurance Policy 2025 V8 final new brand.pdf`
3. Verify: doc-mode banner, PDF-only picker, progress completes in seconds, report appears with at-a-glance table + per-check sections + structured findings.

**Known gotchas to surface during demo:**
- Bold-words bootstrap dictionary contains short terms (`you`, `your`, `we`, `us`) which produce false positives in normal pronoun usage. Mitigated by Phase-2 work (canonical list from AXA).
- Page-numbering heuristic catches TOC-page numbers as false-positive "page numbers" (5 such hits in this doc). Surface as data, score gently.
- Print-code regex tuned to "AG400 11/25" pattern observed in Example 1; may need tuning for other docs.

### Phase 2 — Deterministic checks (~3–5 days)

The cheap, accurate wins. No LLM cost.

- `backend/document_mode/font_compliance.py` — reads PDF font inventory, flags anything not on AXA's approved list. Per-page failure log. Plugs into `process_single_check` via the same early-branch pattern as `dj_file_naming` (line ~384 in `api_server.py`).
- `backend/document_mode/bold_words.py` — scans pages for AXA's bold-words dictionary, flags any occurrence not rendered bold.
- `backend/document_mode/print_code.py` — extracts back-page print code, optionally compares to brief-supplied value.
- `backend/document_mode/omg_versioning.py` — confirms back-page OMG number + date format compliance (regex pattern, similar to `file_naming_validator`).
- New Settings → "AXA Configuration" tab for uploading approved fonts list + bold-words list per client (same pattern as media plan upload).

### Phase 3 — Old-vs-new diff — DONE on `feature/axa-document-mode` 2026-05-01 ✅

Vision-LLM-based page-pair diff. Validates the original Example-2 promise: catches the bold-formatting fixes, structural changes, definition updates, and content additions/removals that V1 missed and V10 fixed.

**Files added:**
- `backend/document_mode/diff_engine.py` — page alignment via difflib SequenceMatcher (windowed fuzzy match, threshold 0.4) + parallel page-pair vision-LLM diff via Gemini 2.5 Pro (8 concurrent). Returns alignment map + structured diff JSON per pair (added/removed/modified/moved/style_changes/severity).
- `backend/document_mode/diff_report_writer.py` — diff-specific HTML/JSON. Versions card, at-a-glance grid (page count delta, severity counts), full alignment table, per-pair cards with severity pills + categorised diff blocks.
- `backend/profiles/axa_policy_document_diff.json` — `mode: document_diff` profile.

**Files modified:**
- `backend/api_server.py` — new `POST /api/document/start_diff` endpoint accepting `old_file` + `new_file`. Reuses `_require_client_access`, `progress_tracker`, `ensure_client_output_folder`, `usage_tracker`.
- `backend/client_config.py` — AXA profile list gains `axa_policy_document_diff`.
- `web_ui.html` — third `mode: document_diff` UX path. Two-slot drop-zone (old + new). `applyProfileMode()` swaps between asset/document/document_diff. `wireDiffPickers()` wires the dual file pickers. `startAnalysis()` + `performAnalysisWithProgress()` route diff-mode submissions to `/api/document/start_diff`.

**Smoke-test 2026-05-01 — V1 (68 pages, broken) vs V10 (74 pages, corrected):**
- Wall: 214 seconds (3:34)
- Tokens: 214,342 (cost ≈ $0.50–$0.70)
- 63 matched pairs · 11 pages added in V10 · 5 pages removed
- 61 pages with differences flagged · 2 unchanged
- Severity: 25 high, 32 medium, 4 low
- Score 0/100, "Major changes" — correct call
- Smoke-test report: `backend/output-dev/axa/phase3_smoke_*_diff_report.html`

**Caught the Example-2-class defects:**
- Bold-formatting changes (e.g. *"the terms 'us', 'we', and 'adviser' are now bolded"*, *"the term 'your' is now bolded"*) — exactly the missed-bold issue that motivated this build.
- New Section F: Legal Protection added in V10 — structural insertion caught.
- New "Period of Insurance" definition added — defined-term addition caught.
- "Employee" definition expanded by a sub-point — definition modification caught.
- Wording fix: *"supply your own expense" → "supply at your own expense"* — body-text correction caught.
- 11 pages flagged as added, 5 as removed — page-count delta and structural restructure caught.

**Cost dial (for show-and-tell):** ~$0.40–0.70 per diff against a typical 70-80-page policy.

**Local test plan (UI):**
1. `./scripts/run-local.sh`
2. AXA → AXA Policy Document — Old vs New Diff
3. Pick V1.pdf as old, V10.pdf as new (both from `axa_ireland/Example 2/`)
4. Click analyse. Wait ~3-5 minutes. Report lands in saved files.

### Phase 4 — PDF accessibility — DONE on `feature/axa-document-mode` 2026-05-01 ✅

**Pure-Python implementation.** Original plan was veraPDF subprocess (Java dependency, ~150MB install). Built deterministic PyMuPDF-based check instead — no Java needed for the demo, with veraPDF as an optional add-on later.

**Files added:**
- `backend/document_mode/accessibility_checks.py` — 9 PDF/UA-aligned criteria checked deterministically:
  - **C1** Tagged PDF (StructTreeRoot present)
  - **C2** Marked content (/MarkInfo /Marked true)
  - **C3** Document title metadata
  - **C4** Document language (/Lang)
  - **C5** No password protection blocking AT
  - **C6** All fonts embedded
  - **C7** PDF version ≥ 1.5
  - **C8** XMP UA-conformance declaration
  - **C9** Alt text on images (sampling)
  - Plus a `_run_verapdf()` stub for future veraPDF integration

**Files modified:**
- `backend/document_mode/checks.py` — `axa_pdf_accessibility` registered.
- `backend/document_mode/ingest.py` — `pdf_path` added to ingest_result so doc-scope checks can read raw PDF structure.
- `backend/document_mode/result_writer.py` — `_render_pdf_accessibility` structured renderer (criteria checklist with pass/fail markers).
- `backend/profiles/axa_policy_document.json` — `axa_pdf_accessibility` added at weight 2.0.

**Smoke-test 2026-05-01 against Example 1 Home Insurance V8 (Adobe InDesign output):**
- Overall AXA Policy Document score: 80.6 / Pass (7 checks, $0 cost, runs in seconds)
- Accessibility check: 7.78 / 10 (7 of 9 criteria passed)
- Real gaps caught:
  - **C7 fail:** PDF 1.4 — should be 1.5+ for full accessibility tagging support
  - **C8 fail:** No PDF/UA-1 conformance flag in XMP metadata
- Pass: tagged structure, marked content, title set, /Lang=en, no encryption, all 10 fonts embedded, alt-text entries detected

**veraPDF integration plan (when ready):**
1. Install veraPDF on host: https://verapdf.org/software/ (requires JRE 8+, ~150MB)
2. Ensure `verapdf` binary on PATH or set `VERAPDF_BIN` env var
3. Replace `_run_verapdf()` stub with `subprocess.run([verapdf, '--format', 'json', '--profile', 'ua1', pdf_path], capture_output=True)` and merge JSON findings into `axa_pdf_accessibility`'s output
4. Set `findings['verapdf_run'] = True`

### Phase 5 — Print preflight — DONE on `feature/axa-document-mode` 2026-05-01 ✅

**Pure-Python implementation.** Original plan was Ghostscript-based; built deterministic PyMuPDF checks instead — same approach as Phase 4. Ghostscript can plug in later for total-ink-coverage / registration-black if scope grows.

**Files added:**
- `backend/document_mode/print_preflight_checks.py` — 7 deterministic preflight criteria:
  - **PP1** Page geometry consistency (single MediaBox size across all pages)
  - **PP2** Bleed area defined (TrimBox/BleedBox differ from MediaBox)
  - **PP3** Image colour spaces (flag DeviceRGB; press wants CMYK/Gray)
  - **PP4** Image effective DPI (raw pixels / rendered inches; flag < 150)
  - **PP5** Transparency / soft-mask usage (flag for flattening)
  - **PP6** PDF/X conformance (XMP `pdfxid:GTS_PDFXVersion`)
  - **PP7** Spot colour usage (flag /Separation, /DeviceN)

**Files modified:**
- `backend/document_mode/checks.py` — `axa_print_preflight` registered.
- `backend/document_mode/result_writer.py` — `_render_print_preflight` structured renderer with low-DPI image list, colour-space breakdown, spot-colour list, page-size detail.
- `backend/profiles/axa_policy_document.json` — `axa_print_preflight` added at weight 1.0.

**Smoke-test 2026-05-01 against Example 1 Home Insurance V8:**
- Print preflight: 5.71 / 10 (4 of 7 criteria pass) — correctly flags as digital-intent
  - ✓ PP1 — All 86 pages 210×297 mm (A4), consistent
  - ✗ PP2 — No bleed authored (digital intent — correct finding for an electronic policy)
  - ✓ PP3 — Only 1 grayscale image, no RGB
  - ✓ PP4 — Image renders at 279 DPI (above 150 threshold)
  - ✗ PP5 — 85 of 86 pages use transparency / soft-masks (Adobe InDesign default; would need flattening for press)
  - ✗ PP6 — No PDF/X conformance flag in XMP
  - ✓ PP7 — No spot colour spaces
- Updated full-profile score: 78.25 / Pass (8 checks now)

**Demo conversation:** *"If you're distributing electronically, this PDF is fine. If you're going to press, you need to (1) author bleed in InDesign, (2) flatten transparency on export, (3) declare PDF/X-1a or PDF/X-4 conformance."*

## All phases status (2026-05-01)

| Phase | Scope | Status | Cost | Wall |
|---|---|---|---|---|
| 1 | Spine + 6 deterministic doc-scope checks | ✅ Done | $0 | seconds |
| 2 | Compliance variants (font/phone/bold lists) | Blocked on AXA | — | — |
| 3 | Old-vs-new diff (vision LLM page-pair) | ✅ Done | ~$0.50/run | ~3-5 min |
| 4 | PDF accessibility (PyMuPDF, veraPDF stub) | ✅ Done | $0 | seconds |
| 5 | Print preflight (PyMuPDF, Ghostscript later) | ✅ Done | $0 | seconds |

**Demo-ready as of 2026-05-01.** All work is on `feature/axa-document-mode`, local-only, no commits or pushes yet.

## Things to flag before any further build

1. **80 pages × multiple LLM checks = serious cost.** A doc with the existing static checks running per-page would be ~$5–10 in Gemini/OpenAI calls. We should decide which LLM checks need per-page vs once over a sampled set. Most should be deterministic-only.
2. **veraPDF is Java.** Adds a JRE dependency to GCP boxes.
3. **PDF mode breaks "one upload = one report" assumption.** Decide what to save: full per-page JSON, summary only, or both. (Phase 1 saves both.)
4. **Reporting/billing.** An 80-page doc is one analysis but 80× the LLM work. We should bill it as one analysis but track total checks separately. `usage_tracker.log_analysis_complete` already gets `pages_processed` in doc mode.

## How doc-mode plugs into existing pipeline

For maintenance — the integration map:

| Doc-mode component | Reuses existing | Where |
|---|---|---|
| Per-page check execution | `process_checks_in_batches()` | `api_server.py:498` |
| Per-check dispatch | `process_single_check()` | `api_server.py:377` |
| LLM call | `run_visual_qc()` | `llm_config.py` |
| Auth + client access | `auth.require_auth`, `_require_client_access()` | `api_server.py:4883` |
| Progress polling | `/api/progress/<id>` | `api_server.py:1695` |
| Output serving | `/output/<client>/<filename>` | `api_server.py:2121` |
| Output listing | `/api/output_files` | `api_server.py:2168` |
| Output folder | `ensure_client_output_folder()` | `api_server.py:856` |
| Profile loading | `profile_config.get_profile()` | `profile_config.py:219` |
| Profile visibility | `client_config.get_profiles_with_visibility()` | `client_config.py:82` |
| Usage logging | `usage_tracker.log_analysis_start/complete()` | `usage_tracker.py:73,100` |

Future deterministic doc-mode checks should follow `_run_dj_file_naming_check()` (`api_server.py:348`) — short-circuit at the top of `process_single_check` before any LLM dispatch.