ai_qc/backend/AXA_DOCUMENT_MODE_PLAN.md
nickviljoen 90563b8cf2 Add AXA document-mode QC pipeline (Phases 1, 3, 4, 5)
Multi-page PDF QC for AXA Ireland policy documents. Runs as a third mode
alongside static + video, gated on profile.mode. New code isolated under
backend/document_mode/ with new endpoints under /api/document/*.

Phase 1 — Spine + 6 deterministic doc-scope checks ($0, runs in seconds):
- Scope-aware dispatcher (document/targeted/page_sample/page_pair/page_each)
- axa_font_inventory, axa_phone_inventory, axa_bold_words_definitions,
  axa_page_numbering, axa_print_code, axa_omg_versioning
- Bootstrap bold-words dictionary extracted from Example 1 General Definitions

Phase 3 — Old-vs-new diff (~$0.50/run, 3-5 min):
- Page alignment via difflib SequenceMatcher (windowed fuzzy match)
- Vision-LLM page-pair diff via Gemini 2.5 Pro (8 concurrent)
- Two-slot upload UX, axa_policy_document_diff profile, mode=document_diff

Phase 4 — PDF accessibility (PyMuPDF, $0):
- 9 PDF/UA-1 aligned criteria (tagged structure, /MarkInfo, title, /Lang,
  encryption, font embedding, PDF version, XMP UA-conformance, alt-text)
- _run_verapdf() stub for optional Java-based veraPDF integration later

Phase 5 — Print preflight (PyMuPDF, $0):
- 7 criteria (page geometry, bleed, image colour spaces, image DPI,
  transparency, PDF/X conformance, spot colours)

Profile additions:
- axa_policy_document — 8 deterministic checks, $0 cost
- axa_policy_document_diff — 1 page-pair LLM check, ~$0.50/run

API additions:
- POST /api/document/start_analysis (single PDF)
- POST /api/document/start_diff (old + new PDFs)

Frontend additions:
- Third profile.mode value (document_diff) in applyProfileMode()
- Two-slot upload UX with PDF-only file pickers
- checkFormValidity() branches by mode for the analyse-button gate

Smoke-tested locally against Example 1 (Home Insurance V8, 86pp) and
Example 2 (Landlord V1 vs V10, 68→74pp) with real findings caught
including bold-words gaps, missing PDF/UA flag, transparency on press,
V1→V10 bold-formatting fixes. Plan + integration map + gotchas in
backend/AXA_DOCUMENT_MODE_PLAN.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:38:14 +02:00

279 lines
21 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AXA Document-Mode QC — Build Plan
Multi-page PDF QC pipeline for AXA Ireland. Different from every other client onboard because the QC *target* is an 80-page policy PDF, not a single image/video. Sources of requirements:
- `axa_ireland/AXA build guide new tools required.txt` — original ask + first scoping
- `axa_ireland/Exampole Folders explained.rtf` — what the example folders represent
- `axa_ireland/Example 1/` — old vs new (post brand-refresh) Home Insurance policy + the human QC checklist (`Policy documents QC CHECKS.docx`)
- `axa_ireland/Example 2/` — Landlord Insurance v1 (shipped with errors), V10 (corrected), V1 2025 amends, original master. **Phase-3 diff test pair: V1 vs V10.**
## Sequencing (updated 2026-05-01)
Inverted from "discover everything → build" to "build minimum demo → show client → drive deeper discovery from their reactions".
1. **Phase 1 refactor** ✅ DONE on `feature/axa-document-mode` 2026-05-01 — scope-aware dispatcher, 6 deterministic checks, $0 LLM cost, runs in seconds. Replaces broken Phase-1 stub.
2. **Phase 3 build** (next) — old-vs-new diff using vision-LLM page-pair, no LlamaParse account needed.
3. **Show-and-tell with AXA** — demonstrate Phases 1 + 3 working, show costs, gather requirements (font list, bold-words list, phone numbers, WCAG target, print preflight scope).
4. **Phases 2, 4, 5** in order, with local testing between each.
5. Local-only until show-and-tell; do not push `feature/axa-document-mode` to dev or merge to develop yet.
## Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation)
| Question | Decision |
|---|---|
| Approved Monotype font list | **Not yet supplied.** Until then, `axa_font_inventory` lists fonts only (informational). When list arrives, becomes `axa_font_compliance`. |
| Bold-words dictionary | **Bootstrap from Example 1 General Definitions** (pages 810) until AXA supplies canonical list. 35 terms extracted, saved to `backend/document_mode/data/axa_bold_words_seed.json`. Some short terms (`you`, `your`, `we`) produce false positives — accept until canonical list lands. |
| Approved phone number list | **Not yet supplied.** `axa_phone_inventory` lists numbers only. Becomes `axa_phone_compliance` when list arrives. |
| WCAG target | **AAA** (Phase 4 veraPDF profile setting) |
| Print preflight scope | **"Is it print-ready?"** simple version. Expand later if needed. |
| Page sampling defaults | **N=8** for visual sanity, **N=5** for print sanity |
| Volume expectation | Still pending — drives later cost decisions |
## Architectural choice
**One web UI, isolated backend module.** Doc-mode shares the existing shell (auth, client picker, Settings, Reporting, Admin, user-access, output history) and runs as a third mode alongside static + video. New code lives in `backend/document_mode/` with new endpoints under `/api/document/*`, gated by `mode: "document"` on the profile JSON. Existing single-asset clients are not touched.
Rejected: a separate `/ai_qc/documents/` page. Would re-implement the shell for one client and fork future doc-mode work per-client.
## Scope-aware dispatcher (Phase 1 refactor, 2026-05-01)
Each check declares its `scope` in the profile JSON. The dispatcher routes by scope:
| Scope | Behaviour | Cost shape | Phase 1 uses? |
|---|---|---|---|
| `document` | Run once over the full ingest result. Deterministic checks live in `document_mode/checks.py`. | $0, milliseconds | ✅ all 4 deterministic doc-checks |
| `targeted` | Run once on specific pages (`scope_args.pages`: `first`, `last`, `first-N`, `last-N`, or list). | $0, milliseconds | ✅ print_code, omg_versioning |
| `page_sample` | Run on N evenly-spaced pages via existing batch dispatcher. | N × LLM call | Phase 4/5 |
| `page_pair` | Run on aligned old/new page pairs. | M × LLM call | Phase 3 |
| `page_each` | Run on every page (legacy / very expensive). | N × LLM call per check | Avoid |
Profile JSON shape:
```json
"checks": {
"axa_font_inventory": {"weight": 1.0, "scope": "document", "enabled": true},
"axa_print_code": {"weight": 1.0, "scope": "targeted", "scope_args": {"pages": "last"}, "enabled": true}
}
```
The scope field is optional in the QCCheckConfig dataclass — defaults to None, which the doc-mode dispatcher treats as `page_each` for legacy compatibility. Asset-mode pipeline ignores scope entirely.
## Phased delivery
Each phase ships as its own tag so we can demo / rollback in slices.
### Phase 0 — Discovery (you, before further code)
Need from AXA before building Phase 2+:
- [ ] Approved **Monotype font list** (family + weights). Brand refresh moved them off old BOX-licensed fonts; we need the canonical list for `axa_font_compliance`.
- [ ] **Bold-words dictionary** for General Definitions (Example 2 says 70+).
- [ ] 23 more old/new PDF pairs beyond Examples 1 & 2 (ideally Motor Insurance for diversity).
- [ ] WCAG target — AA or AAA. EAA scope confirmation.
- [ ] Print preflight scope — "is it print-ready?" or full PDF/X-1a/4 compliance.
- [ ] Volume expectation — 80 pages × how many docs/month.
### Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01 ✅
**Original Phase 1 (2026-04-29):** spine only, ran existing image-based `accessibility` per page on all 86 pages. Smoke test ran in ~70min for ~$0.50. Output report revealed every "failing page" failed for the same false-positive reason: *"document is presented as an image of text / WCAG 1.4.5"* — the LLM was critiquing our PNG rendering pipeline, not AXA's actual PDF. Result: noisy 67.7/Pass with no actionable findings.
**Refactor (2026-05-01) on `feature/axa-document-mode`:** scope-aware dispatcher + 6 deterministic doc-scope checks. Same Home Insurance PDF now scores 81.4/Pass in seconds at $0 cost, with real findings: font inventory, phone-number inventory, 132 non-bold defined-term occurrences flagged across 53 pages, 5 page-numbering discontinuities, print code "AG400 11/25" detected.
Files added:
- `backend/document_mode/__init__.py`
- `backend/document_mode/ingest.py` — multi-page PDF → per-page PNGs + per-span structured text (font, size, bold flag, italic flag, bbox). Uses PyMuPDF. Bold detection = `flags & 16` OR font name contains `bold|black|heavy`. Default zoom 2.0×, max dim 1600 px, page cap 200.
- `backend/document_mode/dispatcher.py`**scope-aware** routing. `document` and `targeted` checks bypass LLM; `page_sample` / `page_each` use existing batch dispatcher; `page_pair` reserved for Phase 3.
- `backend/document_mode/checks.py` — registry of 6 deterministic doc-scope checks (see table below). Each returns `{check_name, scope, score, pass, summary, findings, response}`.
- `backend/document_mode/data/axa_bold_words_seed.json` — bootstrap dictionary, 35 terms extracted from Example 1 General Definitions (pages 8-10).
- `backend/document_mode/result_writer.py` — writes JSON + self-contained HTML with: at-a-glance findings table, per-check sections with structured renderers (font/phone/bold-words/page-numbering/print-code/OMG each get their own table), per-page summary strip. Reports collapsed by default.
- `backend/profiles/axa_policy_document.json` — production profile with `mode: "document"`, 6 deterministic checks, `visibility: client_specific, visible_to_clients: ["axa"]`.
### The 6 Phase-1 deterministic checks
| Check | Scope | What it does | Becomes (when client supplies data) |
|---|---|---|---|
| `axa_font_inventory` | document | Lists every unique font + per-page distribution | `axa_font_compliance` (flags non-approved) |
| `axa_phone_inventory` | document | Regex-extracts every phone-shaped number, dedup, with page refs | `axa_phone_compliance` (flags non-approved) |
| `axa_bold_words_definitions` | document | Scans for seed-dictionary terms, flags non-bold occurrences | Same — replace seed dict with AXA's canonical list |
| `axa_page_numbering` | document | Detects standalone-line integers near top/bottom, flags discontinuities | Same |
| `axa_print_code` | targeted: last | Finds back-page print/version line components (code + ref + date + version) | Same — refine regex once AXA confirms format |
| `axa_omg_versioning` | targeted: last | Finds OMG code + date format on back page | Same |
Files modified:
- `backend/profile_config.py``Profile.mode` field defaults to `"asset"`. `QCCheckConfig` gains `scope` and `scope_args` fields, both optional. Persisted only when non-default.
- `backend/api_server.py``POST /api/document/start_analysis` endpoint. The `enabled_checks` filter accepts checks from the document-mode registry (`is_document_scope_check`) in addition to the legacy `qc_apps` registry, so deterministic AXA checks aren't filtered out.
- `backend/client_config.py` — AXA client gains `axa_policy_document` as first profile.
- `web_ui.html` — doc-mode banner under upload area, file-input `accept` swapped to PDF-only, `performAnalysisWithProgress` routes to `/api/document/start_analysis` with `client_id`.
**Smoke-tested 2026-05-01 (post-refactor) against same Home Insurance PDF:**
- Score: 81.4 / 100 (Pass)
- Total runtime: a few seconds (deterministic only, no LLM calls)
- Total cost: $0
- Findings: 10 fonts, 8 phone numbers, 132 non-bold defined-term occurrences across 53 pages, 5 page-number discontinuities, print code "AG400" + "11/25" detected on back page, no OMG present.
- Smoke-test report: `backend/output-dev/axa/PHASE1_REFACTOR_smoke_test_report.html`
**Local test plan (after `./scripts/run-local.sh`):**
1. Pick AXA client → AXA Policy Document profile
2. Upload `axa_ireland/Example 1/6317047 - AXA - Home Insurance Policy 2025 V8 final new brand.pdf`
3. Verify: doc-mode banner, PDF-only picker, progress completes in seconds, report appears with at-a-glance table + per-check sections + structured findings.
**Known gotchas to surface during demo:**
- Bold-words bootstrap dictionary contains short terms (`you`, `your`, `we`, `us`) which produce false positives in normal pronoun usage. Mitigated by Phase-2 work (canonical list from AXA).
- Page-numbering heuristic catches TOC-page numbers as false-positive "page numbers" (5 such hits in this doc). Surface as data, score gently.
- Print-code regex tuned to "AG400 11/25" pattern observed in Example 1; may need tuning for other docs.
### Phase 2 — Deterministic checks (~35 days)
The cheap, accurate wins. No LLM cost.
- `backend/document_mode/font_compliance.py` — reads PDF font inventory, flags anything not on AXA's approved list. Per-page failure log. Plugs into `process_single_check` via the same early-branch pattern as `dj_file_naming` (line ~384 in `api_server.py`).
- `backend/document_mode/bold_words.py` — scans pages for AXA's bold-words dictionary, flags any occurrence not rendered bold.
- `backend/document_mode/print_code.py` — extracts back-page print code, optionally compares to brief-supplied value.
- `backend/document_mode/omg_versioning.py` — confirms back-page OMG number + date format compliance (regex pattern, similar to `file_naming_validator`).
- New Settings → "AXA Configuration" tab for uploading approved fonts list + bold-words list per client (same pattern as media plan upload).
### Phase 3 — Old-vs-new diff — DONE on `feature/axa-document-mode` 2026-05-01 ✅
Vision-LLM-based page-pair diff. Validates the original Example-2 promise: catches the bold-formatting fixes, structural changes, definition updates, and content additions/removals that V1 missed and V10 fixed.
**Files added:**
- `backend/document_mode/diff_engine.py` — page alignment via difflib SequenceMatcher (windowed fuzzy match, threshold 0.4) + parallel page-pair vision-LLM diff via Gemini 2.5 Pro (8 concurrent). Returns alignment map + structured diff JSON per pair (added/removed/modified/moved/style_changes/severity).
- `backend/document_mode/diff_report_writer.py` — diff-specific HTML/JSON. Versions card, at-a-glance grid (page count delta, severity counts), full alignment table, per-pair cards with severity pills + categorised diff blocks.
- `backend/profiles/axa_policy_document_diff.json``mode: document_diff` profile.
**Files modified:**
- `backend/api_server.py` — new `POST /api/document/start_diff` endpoint accepting `old_file` + `new_file`. Reuses `_require_client_access`, `progress_tracker`, `ensure_client_output_folder`, `usage_tracker`.
- `backend/client_config.py` — AXA profile list gains `axa_policy_document_diff`.
- `web_ui.html` — third `mode: document_diff` UX path. Two-slot drop-zone (old + new). `applyProfileMode()` swaps between asset/document/document_diff. `wireDiffPickers()` wires the dual file pickers. `startAnalysis()` + `performAnalysisWithProgress()` route diff-mode submissions to `/api/document/start_diff`.
**Smoke-test 2026-05-01 — V1 (68 pages, broken) vs V10 (74 pages, corrected):**
- Wall: 214 seconds (3:34)
- Tokens: 214,342 (cost ≈ $0.50$0.70)
- 63 matched pairs · 11 pages added in V10 · 5 pages removed
- 61 pages with differences flagged · 2 unchanged
- Severity: 25 high, 32 medium, 4 low
- Score 0/100, "Major changes" — correct call
- Smoke-test report: `backend/output-dev/axa/phase3_smoke_*_diff_report.html`
**Caught the Example-2-class defects:**
- Bold-formatting changes (e.g. *"the terms 'us', 'we', and 'adviser' are now bolded"*, *"the term 'your' is now bolded"*) — exactly the missed-bold issue that motivated this build.
- New Section F: Legal Protection added in V10 — structural insertion caught.
- New "Period of Insurance" definition added — defined-term addition caught.
- "Employee" definition expanded by a sub-point — definition modification caught.
- Wording fix: *"supply your own expense" → "supply at your own expense"* — body-text correction caught.
- 11 pages flagged as added, 5 as removed — page-count delta and structural restructure caught.
**Cost dial (for show-and-tell):** ~$0.400.70 per diff against a typical 70-80-page policy.
**Local test plan (UI):**
1. `./scripts/run-local.sh`
2. AXA → AXA Policy Document — Old vs New Diff
3. Pick V1.pdf as old, V10.pdf as new (both from `axa_ireland/Example 2/`)
4. Click analyse. Wait ~3-5 minutes. Report lands in saved files.
### Phase 4 — PDF accessibility — DONE on `feature/axa-document-mode` 2026-05-01 ✅
**Pure-Python implementation.** Original plan was veraPDF subprocess (Java dependency, ~150MB install). Built deterministic PyMuPDF-based check instead — no Java needed for the demo, with veraPDF as an optional add-on later.
**Files added:**
- `backend/document_mode/accessibility_checks.py` — 9 PDF/UA-aligned criteria checked deterministically:
- **C1** Tagged PDF (StructTreeRoot present)
- **C2** Marked content (/MarkInfo /Marked true)
- **C3** Document title metadata
- **C4** Document language (/Lang)
- **C5** No password protection blocking AT
- **C6** All fonts embedded
- **C7** PDF version ≥ 1.5
- **C8** XMP UA-conformance declaration
- **C9** Alt text on images (sampling)
- Plus a `_run_verapdf()` stub for future veraPDF integration
**Files modified:**
- `backend/document_mode/checks.py``axa_pdf_accessibility` registered.
- `backend/document_mode/ingest.py``pdf_path` added to ingest_result so doc-scope checks can read raw PDF structure.
- `backend/document_mode/result_writer.py``_render_pdf_accessibility` structured renderer (criteria checklist with pass/fail markers).
- `backend/profiles/axa_policy_document.json``axa_pdf_accessibility` added at weight 2.0.
**Smoke-test 2026-05-01 against Example 1 Home Insurance V8 (Adobe InDesign output):**
- Overall AXA Policy Document score: 80.6 / Pass (7 checks, $0 cost, runs in seconds)
- Accessibility check: 7.78 / 10 (7 of 9 criteria passed)
- Real gaps caught:
- **C7 fail:** PDF 1.4 — should be 1.5+ for full accessibility tagging support
- **C8 fail:** No PDF/UA-1 conformance flag in XMP metadata
- Pass: tagged structure, marked content, title set, /Lang=en, no encryption, all 10 fonts embedded, alt-text entries detected
**veraPDF integration plan (when ready):**
1. Install veraPDF on host: https://verapdf.org/software/ (requires JRE 8+, ~150MB)
2. Ensure `verapdf` binary on PATH or set `VERAPDF_BIN` env var
3. Replace `_run_verapdf()` stub with `subprocess.run([verapdf, '--format', 'json', '--profile', 'ua1', pdf_path], capture_output=True)` and merge JSON findings into `axa_pdf_accessibility`'s output
4. Set `findings['verapdf_run'] = True`
### Phase 5 — Print preflight — DONE on `feature/axa-document-mode` 2026-05-01 ✅
**Pure-Python implementation.** Original plan was Ghostscript-based; built deterministic PyMuPDF checks instead — same approach as Phase 4. Ghostscript can plug in later for total-ink-coverage / registration-black if scope grows.
**Files added:**
- `backend/document_mode/print_preflight_checks.py` — 7 deterministic preflight criteria:
- **PP1** Page geometry consistency (single MediaBox size across all pages)
- **PP2** Bleed area defined (TrimBox/BleedBox differ from MediaBox)
- **PP3** Image colour spaces (flag DeviceRGB; press wants CMYK/Gray)
- **PP4** Image effective DPI (raw pixels / rendered inches; flag < 150)
- **PP5** Transparency / soft-mask usage (flag for flattening)
- **PP6** PDF/X conformance (XMP `pdfxid:GTS_PDFXVersion`)
- **PP7** Spot colour usage (flag /Separation, /DeviceN)
**Files modified:**
- `backend/document_mode/checks.py` `axa_print_preflight` registered.
- `backend/document_mode/result_writer.py` `_render_print_preflight` structured renderer with low-DPI image list, colour-space breakdown, spot-colour list, page-size detail.
- `backend/profiles/axa_policy_document.json` `axa_print_preflight` added at weight 1.0.
**Smoke-test 2026-05-01 against Example 1 Home Insurance V8:**
- Print preflight: 5.71 / 10 (4 of 7 criteria pass) correctly flags as digital-intent
- PP1 All 86 pages 210×297 mm (A4), consistent
- PP2 No bleed authored (digital intent correct finding for an electronic policy)
- PP3 Only 1 grayscale image, no RGB
- PP4 Image renders at 279 DPI (above 150 threshold)
- PP5 85 of 86 pages use transparency / soft-masks (Adobe InDesign default; would need flattening for press)
- PP6 No PDF/X conformance flag in XMP
- PP7 No spot colour spaces
- Updated full-profile score: 78.25 / Pass (8 checks now)
**Demo conversation:** *"If you're distributing electronically, this PDF is fine. If you're going to press, you need to (1) author bleed in InDesign, (2) flatten transparency on export, (3) declare PDF/X-1a or PDF/X-4 conformance."*
## All phases status (2026-05-01)
| Phase | Scope | Status | Cost | Wall |
|---|---|---|---|---|
| 1 | Spine + 6 deterministic doc-scope checks | Done | $0 | seconds |
| 2 | Compliance variants (font/phone/bold lists) | Blocked on AXA | | |
| 3 | Old-vs-new diff (vision LLM page-pair) | Done | ~$0.50/run | ~3-5 min |
| 4 | PDF accessibility (PyMuPDF, veraPDF stub) | Done | $0 | seconds |
| 5 | Print preflight (PyMuPDF, Ghostscript later) | Done | $0 | seconds |
**Demo-ready as of 2026-05-01.** All work is on `feature/axa-document-mode`, local-only, no commits or pushes yet.
## Things to flag before any further build
1. **80 pages × multiple LLM checks = serious cost.** A doc with the existing static checks running per-page would be ~$510 in Gemini/OpenAI calls. We should decide which LLM checks need per-page vs once over a sampled set. Most should be deterministic-only.
2. **veraPDF is Java.** Adds a JRE dependency to GCP boxes.
3. **PDF mode breaks "one upload = one report" assumption.** Decide what to save: full per-page JSON, summary only, or both. (Phase 1 saves both.)
4. **Reporting/billing.** An 80-page doc is one analysis but 80× the LLM work. We should bill it as one analysis but track total checks separately. `usage_tracker.log_analysis_complete` already gets `pages_processed` in doc mode.
## How doc-mode plugs into existing pipeline
For maintenance the integration map:
| Doc-mode component | Reuses existing | Where |
|---|---|---|
| Per-page check execution | `process_checks_in_batches()` | `api_server.py:498` |
| Per-check dispatch | `process_single_check()` | `api_server.py:377` |
| LLM call | `run_visual_qc()` | `llm_config.py` |
| Auth + client access | `auth.require_auth`, `_require_client_access()` | `api_server.py:4883` |
| Progress polling | `/api/progress/<id>` | `api_server.py:1695` |
| Output serving | `/output/<client>/<filename>` | `api_server.py:2121` |
| Output listing | `/api/output_files` | `api_server.py:2168` |
| Output folder | `ensure_client_output_folder()` | `api_server.py:856` |
| Profile loading | `profile_config.get_profile()` | `profile_config.py:219` |
| Profile visibility | `client_config.get_profiles_with_visibility()` | `client_config.py:82` |
| Usage logging | `usage_tracker.log_analysis_start/complete()` | `usage_tracker.py:73,100` |
Future deterministic doc-mode checks should follow `_run_dj_file_naming_check()` (`api_server.py:348`) short-circuit at the top of `process_single_check` before any LLM dispatch.