nickviljoen 90563b8cf2 Add AXA document-mode QC pipeline (Phases 1, 3, 4, 5)

Multi-page PDF QC for AXA Ireland policy documents. Runs as a third mode
alongside static + video, gated on profile.mode. New code isolated under
backend/document_mode/ with new endpoints under /api/document/*.

Phase 1 — Spine + 6 deterministic doc-scope checks ($0, runs in seconds):
- Scope-aware dispatcher (document/targeted/page_sample/page_pair/page_each)
- axa_font_inventory, axa_phone_inventory, axa_bold_words_definitions,
  axa_page_numbering, axa_print_code, axa_omg_versioning
- Bootstrap bold-words dictionary extracted from Example 1 General Definitions

Phase 3 — Old-vs-new diff (~$0.50/run, 3-5 min):
- Page alignment via difflib SequenceMatcher (windowed fuzzy match)
- Vision-LLM page-pair diff via Gemini 2.5 Pro (8 concurrent)
- Two-slot upload UX, axa_policy_document_diff profile, mode=document_diff

Phase 4 — PDF accessibility (PyMuPDF, $0):
- 9 PDF/UA-1 aligned criteria (tagged structure, /MarkInfo, title, /Lang,
  encryption, font embedding, PDF version, XMP UA-conformance, alt-text)
- _run_verapdf() stub for optional Java-based veraPDF integration later

Phase 5 — Print preflight (PyMuPDF, $0):
- 7 criteria (page geometry, bleed, image colour spaces, image DPI,
  transparency, PDF/X conformance, spot colours)

Profile additions:
- axa_policy_document — 8 deterministic checks, $0 cost
- axa_policy_document_diff — 1 page-pair LLM check, ~$0.50/run

API additions:
- POST /api/document/start_analysis (single PDF)
- POST /api/document/start_diff (old + new PDFs)

Frontend additions:
- Third profile.mode value (document_diff) in applyProfileMode()
- Two-slot upload UX with PDF-only file pickers
- checkFormValidity() branches by mode for the analyse-button gate

Smoke-tested locally against Example 1 (Home Insurance V8, 86pp) and
Example 2 (Landlord V1 vs V10, 68→74pp) with real findings caught
including bold-words gaps, missing PDF/UA flag, transparency on press,
V1→V10 bold-formatting fixes. Plan + integration map + gotchas in
backend/AXA_DOCUMENT_MODE_PLAN.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 18:38:14 +02:00

21 KiB

Raw Permalink Blame History

AXA Document-Mode QC — Build Plan

Multi-page PDF QC pipeline for AXA Ireland. Different from every other client onboard because the QC target is an 80-page policy PDF, not a single image/video. Sources of requirements:

axa_ireland/AXA build guide new tools required.txt — original ask + first scoping
axa_ireland/Exampole Folders explained.rtf — what the example folders represent
axa_ireland/Example 1/ — old vs new (post brand-refresh) Home Insurance policy + the human QC checklist (Policy documents QC CHECKS.docx)
axa_ireland/Example 2/ — Landlord Insurance v1 (shipped with errors), V10 (corrected), V1 2025 amends, original master. Phase-3 diff test pair: V1 vs V10.

Sequencing (updated 2026-05-01)

Inverted from "discover everything → build" to "build minimum demo → show client → drive deeper discovery from their reactions".

Phase 1 refactor ✅ DONE on feature/axa-document-mode 2026-05-01 — scope-aware dispatcher, 6 deterministic checks, $0 LLM cost, runs in seconds. Replaces broken Phase-1 stub.
Phase 3 build (next) — old-vs-new diff using vision-LLM page-pair, no LlamaParse account needed.
Show-and-tell with AXA — demonstrate Phases 1 + 3 working, show costs, gather requirements (font list, bold-words list, phone numbers, WCAG target, print preflight scope).
Phases 2, 4, 5 in order, with local testing between each.
Local-only until show-and-tell; do not push feature/axa-document-mode to dev or merge to develop yet.

Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation)

Question	Decision
Approved Monotype font list	Not yet supplied. Until then, `axa_font_inventory` lists fonts only (informational). When list arrives, becomes `axa_font_compliance`.
Bold-words dictionary	Bootstrap from Example 1 General Definitions (pages 8–10) until AXA supplies canonical list. 35 terms extracted, saved to `backend/document_mode/data/axa_bold_words_seed.json`. Some short terms (`you`, `your`, `we`) produce false positives — accept until canonical list lands.
Approved phone number list	Not yet supplied. `axa_phone_inventory` lists numbers only. Becomes `axa_phone_compliance` when list arrives.
WCAG target	AAA (Phase 4 veraPDF profile setting)
Print preflight scope	"Is it print-ready?" simple version. Expand later if needed.
Page sampling defaults	N=8 for visual sanity, N=5 for print sanity
Volume expectation	Still pending — drives later cost decisions

Architectural choice

One web UI, isolated backend module. Doc-mode shares the existing shell (auth, client picker, Settings, Reporting, Admin, user-access, output history) and runs as a third mode alongside static + video. New code lives in backend/document_mode/ with new endpoints under /api/document/*, gated by mode: "document" on the profile JSON. Existing single-asset clients are not touched.

Rejected: a separate /ai_qc/documents/ page. Would re-implement the shell for one client and fork future doc-mode work per-client.

Scope-aware dispatcher (Phase 1 refactor, 2026-05-01)

Each check declares its scope in the profile JSON. The dispatcher routes by scope:

Scope	Behaviour	Cost shape	Phase 1 uses?
`document`	Run once over the full ingest result. Deterministic checks live in `document_mode/checks.py`.	$0, milliseconds	✅ all 4 deterministic doc-checks
`targeted`	Run once on specific pages (`scope_args.pages`: `first`, `last`, `first-N`, `last-N`, or list).	$0, milliseconds	✅ print_code, omg_versioning
`page_sample`	Run on N evenly-spaced pages via existing batch dispatcher.	N × LLM call	Phase 4/5
`page_pair`	Run on aligned old/new page pairs.	M × LLM call	Phase 3
`page_each`	Run on every page (legacy / very expensive).	N × LLM call per check	Avoid

Profile JSON shape:

"checks": {
  "axa_font_inventory": {"weight": 1.0, "scope": "document", "enabled": true},
  "axa_print_code": {"weight": 1.0, "scope": "targeted", "scope_args": {"pages": "last"}, "enabled": true}
}

The scope field is optional in the QCCheckConfig dataclass — defaults to None, which the doc-mode dispatcher treats as page_each for legacy compatibility. Asset-mode pipeline ignores scope entirely.

Phased delivery

Each phase ships as its own tag so we can demo / rollback in slices.

Phase 0 — Discovery (you, before further code)

Need from AXA before building Phase 2+:

Approved Monotype font list (family + weights). Brand refresh moved them off old BOX-licensed fonts; we need the canonical list for axa_font_compliance.
Bold-words dictionary for General Definitions (Example 2 says 70+).
2–3 more old/new PDF pairs beyond Examples 1 & 2 (ideally Motor Insurance for diversity).
WCAG target — AA or AAA. EAA scope confirmation.
Print preflight scope — "is it print-ready?" or full PDF/X-1a/4 compliance.
Volume expectation — 80 pages × how many docs/month.

Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01 ✅

Original Phase 1 (2026-04-29): spine only, ran existing image-based accessibility per page on all 86 pages. Smoke test ran in ~70min for ~$0.50. Output report revealed every "failing page" failed for the same false-positive reason: "document is presented as an image of text / WCAG 1.4.5" — the LLM was critiquing our PNG rendering pipeline, not AXA's actual PDF. Result: noisy 67.7/Pass with no actionable findings.

Refactor (2026-05-01) on feature/axa-document-mode: scope-aware dispatcher + 6 deterministic doc-scope checks. Same Home Insurance PDF now scores 81.4/Pass in seconds at $0 cost, with real findings: font inventory, phone-number inventory, 132 non-bold defined-term occurrences flagged across 53 pages, 5 page-numbering discontinuities, print code "AG400 11/25" detected.

Files added:

backend/document_mode/__init__.py
backend/document_mode/ingest.py — multi-page PDF → per-page PNGs + per-span structured text (font, size, bold flag, italic flag, bbox). Uses PyMuPDF. Bold detection = flags & 16 OR font name contains bold|black|heavy. Default zoom 2.0×, max dim 1600 px, page cap 200.
backend/document_mode/dispatcher.py — scope-aware routing. document and targeted checks bypass LLM; page_sample / page_each use existing batch dispatcher; page_pair reserved for Phase 3.
backend/document_mode/checks.py — registry of 6 deterministic doc-scope checks (see table below). Each returns {check_name, scope, score, pass, summary, findings, response}.
backend/document_mode/data/axa_bold_words_seed.json — bootstrap dictionary, 35 terms extracted from Example 1 General Definitions (pages 8-10).
backend/document_mode/result_writer.py — writes JSON + self-contained HTML with: at-a-glance findings table, per-check sections with structured renderers (font/phone/bold-words/page-numbering/print-code/OMG each get their own table), per-page summary strip. Reports collapsed by default.
backend/profiles/axa_policy_document.json — production profile with mode: "document", 6 deterministic checks, visibility: client_specific, visible_to_clients: ["axa"].

The 6 Phase-1 deterministic checks

Check	Scope	What it does	Becomes (when client supplies data)
`axa_font_inventory`	document	Lists every unique font + per-page distribution	`axa_font_compliance` (flags non-approved)
`axa_phone_inventory`	document	Regex-extracts every phone-shaped number, dedup, with page refs	`axa_phone_compliance` (flags non-approved)
`axa_bold_words_definitions`	document	Scans for seed-dictionary terms, flags non-bold occurrences	Same — replace seed dict with AXA's canonical list
`axa_page_numbering`	document	Detects standalone-line integers near top/bottom, flags discontinuities	Same
`axa_print_code`	targeted: last	Finds back-page print/version line components (code + ref + date + version)	Same — refine regex once AXA confirms format
`axa_omg_versioning`	targeted: last	Finds OMG code + date format on back page	Same

Files modified:

backend/profile_config.py — Profile.mode field defaults to "asset". QCCheckConfig gains scope and scope_args fields, both optional. Persisted only when non-default.
backend/api_server.py — POST /api/document/start_analysis endpoint. The enabled_checks filter accepts checks from the document-mode registry (is_document_scope_check) in addition to the legacy qc_apps registry, so deterministic AXA checks aren't filtered out.
backend/client_config.py — AXA client gains axa_policy_document as first profile.
web_ui.html — doc-mode banner under upload area, file-input accept swapped to PDF-only, performAnalysisWithProgress routes to /api/document/start_analysis with client_id.

Smoke-tested 2026-05-01 (post-refactor) against same Home Insurance PDF:

Score: 81.4 / 100 (Pass)
Total runtime: a few seconds (deterministic only, no LLM calls)
Total cost: $0
Findings: 10 fonts, 8 phone numbers, 132 non-bold defined-term occurrences across 53 pages, 5 page-number discontinuities, print code "AG400" + "11/25" detected on back page, no OMG present.
Smoke-test report: backend/output-dev/axa/PHASE1_REFACTOR_smoke_test_report.html

Local test plan (after ./scripts/run-local.sh):

Pick AXA client → AXA Policy Document profile
Upload axa_ireland/Example 1/6317047 - AXA - Home Insurance Policy 2025 V8 final new brand.pdf
Verify: doc-mode banner, PDF-only picker, progress completes in seconds, report appears with at-a-glance table + per-check sections + structured findings.

Known gotchas to surface during demo:

Bold-words bootstrap dictionary contains short terms (you, your, we, us) which produce false positives in normal pronoun usage. Mitigated by Phase-2 work (canonical list from AXA).
Page-numbering heuristic catches TOC-page numbers as false-positive "page numbers" (5 such hits in this doc). Surface as data, score gently.
Print-code regex tuned to "AG400 11/25" pattern observed in Example 1; may need tuning for other docs.

Phase 2 — Deterministic checks (~3–5 days)

The cheap, accurate wins. No LLM cost.

backend/document_mode/font_compliance.py — reads PDF font inventory, flags anything not on AXA's approved list. Per-page failure log. Plugs into process_single_check via the same early-branch pattern as dj_file_naming (line ~384 in api_server.py).
backend/document_mode/bold_words.py — scans pages for AXA's bold-words dictionary, flags any occurrence not rendered bold.
backend/document_mode/print_code.py — extracts back-page print code, optionally compares to brief-supplied value.
backend/document_mode/omg_versioning.py — confirms back-page OMG number + date format compliance (regex pattern, similar to file_naming_validator).
New Settings → "AXA Configuration" tab for uploading approved fonts list + bold-words list per client (same pattern as media plan upload).

Phase 3 — Old-vs-new diff — DONE on `feature/axa-document-mode` 2026-05-01 ✅

Vision-LLM-based page-pair diff. Validates the original Example-2 promise: catches the bold-formatting fixes, structural changes, definition updates, and content additions/removals that V1 missed and V10 fixed.

Files added:

backend/document_mode/diff_engine.py — page alignment via difflib SequenceMatcher (windowed fuzzy match, threshold 0.4) + parallel page-pair vision-LLM diff via Gemini 2.5 Pro (8 concurrent). Returns alignment map + structured diff JSON per pair (added/removed/modified/moved/style_changes/severity).
backend/document_mode/diff_report_writer.py — diff-specific HTML/JSON. Versions card, at-a-glance grid (page count delta, severity counts), full alignment table, per-pair cards with severity pills + categorised diff blocks.
backend/profiles/axa_policy_document_diff.json — mode: document_diff profile.

Files modified:

backend/api_server.py — new POST /api/document/start_diff endpoint accepting old_file + new_file. Reuses _require_client_access, progress_tracker, ensure_client_output_folder, usage_tracker.
backend/client_config.py — AXA profile list gains axa_policy_document_diff.
web_ui.html — third mode: document_diff UX path. Two-slot drop-zone (old + new). applyProfileMode() swaps between asset/document/document_diff. wireDiffPickers() wires the dual file pickers. startAnalysis() + performAnalysisWithProgress() route diff-mode submissions to /api/document/start_diff.

Smoke-test 2026-05-01 — V1 (68 pages, broken) vs V10 (74 pages, corrected):

Wall: 214 seconds (3:34)
Tokens: 214,342 (cost ≈ $0.50–$0.70)
63 matched pairs · 11 pages added in V10 · 5 pages removed
61 pages with differences flagged · 2 unchanged
Severity: 25 high, 32 medium, 4 low
Score 0/100, "Major changes" — correct call
Smoke-test report: backend/output-dev/axa/phase3_smoke_*_diff_report.html

Caught the Example-2-class defects:

Bold-formatting changes (e.g. "the terms 'us', 'we', and 'adviser' are now bolded", "the term 'your' is now bolded") — exactly the missed-bold issue that motivated this build.
New Section F: Legal Protection added in V10 — structural insertion caught.
New "Period of Insurance" definition added — defined-term addition caught.
"Employee" definition expanded by a sub-point — definition modification caught.
Wording fix: "supply your own expense" → "supply at your own expense" — body-text correction caught.
11 pages flagged as added, 5 as removed — page-count delta and structural restructure caught.

Cost dial (for show-and-tell): ~$0.40–0.70 per diff against a typical 70-80-page policy.

Local test plan (UI):

./scripts/run-local.sh
AXA → AXA Policy Document — Old vs New Diff
Pick V1.pdf as old, V10.pdf as new (both from axa_ireland/Example 2/)
Click analyse. Wait ~3-5 minutes. Report lands in saved files.

Phase 4 — PDF accessibility — DONE on `feature/axa-document-mode` 2026-05-01 ✅

Pure-Python implementation. Original plan was veraPDF subprocess (Java dependency, ~150MB install). Built deterministic PyMuPDF-based check instead — no Java needed for the demo, with veraPDF as an optional add-on later.

Files added:

backend/document_mode/accessibility_checks.py — 9 PDF/UA-aligned criteria checked deterministically:
- C1 Tagged PDF (StructTreeRoot present)
- C2 Marked content (/MarkInfo /Marked true)
- C3 Document title metadata
- C4 Document language (/Lang)
- C5 No password protection blocking AT
- C6 All fonts embedded
- C7 PDF version ≥ 1.5
- C8 XMP UA-conformance declaration
- C9 Alt text on images (sampling)
- Plus a _run_verapdf() stub for future veraPDF integration

Files modified:

backend/document_mode/checks.py — axa_pdf_accessibility registered.
backend/document_mode/ingest.py — pdf_path added to ingest_result so doc-scope checks can read raw PDF structure.
backend/document_mode/result_writer.py — _render_pdf_accessibility structured renderer (criteria checklist with pass/fail markers).
backend/profiles/axa_policy_document.json — axa_pdf_accessibility added at weight 2.0.

Smoke-test 2026-05-01 against Example 1 Home Insurance V8 (Adobe InDesign output):

Overall AXA Policy Document score: 80.6 / Pass (7 checks, $0 cost, runs in seconds)
Accessibility check: 7.78 / 10 (7 of 9 criteria passed)
Real gaps caught:
- C7 fail: PDF 1.4 — should be 1.5+ for full accessibility tagging support
- C8 fail: No PDF/UA-1 conformance flag in XMP metadata
Pass: tagged structure, marked content, title set, /Lang=en, no encryption, all 10 fonts embedded, alt-text entries detected

veraPDF integration plan (when ready):

Install veraPDF on host: https://verapdf.org/software/ (requires JRE 8+, ~150MB)
Ensure verapdf binary on PATH or set VERAPDF_BIN env var
Replace _run_verapdf() stub with subprocess.run([verapdf, '--format', 'json', '--profile', 'ua1', pdf_path], capture_output=True) and merge JSON findings into axa_pdf_accessibility's output
Set findings['verapdf_run'] = True

Phase 5 — Print preflight — DONE on `feature/axa-document-mode` 2026-05-01 ✅

Pure-Python implementation. Original plan was Ghostscript-based; built deterministic PyMuPDF checks instead — same approach as Phase 4. Ghostscript can plug in later for total-ink-coverage / registration-black if scope grows.

Files added:

backend/document_mode/print_preflight_checks.py — 7 deterministic preflight criteria:
- PP1 Page geometry consistency (single MediaBox size across all pages)
- PP2 Bleed area defined (TrimBox/BleedBox differ from MediaBox)
- PP3 Image colour spaces (flag DeviceRGB; press wants CMYK/Gray)
- PP4 Image effective DPI (raw pixels / rendered inches; flag < 150)
- PP5 Transparency / soft-mask usage (flag for flattening)
- PP6 PDF/X conformance (XMP pdfxid:GTS_PDFXVersion)
- PP7 Spot colour usage (flag /Separation, /DeviceN)

Files modified:

backend/document_mode/checks.py — axa_print_preflight registered.
backend/document_mode/result_writer.py — _render_print_preflight structured renderer with low-DPI image list, colour-space breakdown, spot-colour list, page-size detail.
backend/profiles/axa_policy_document.json — axa_print_preflight added at weight 1.0.

Smoke-test 2026-05-01 against Example 1 Home Insurance V8:

Print preflight: 5.71 / 10 (4 of 7 criteria pass) — correctly flags as digital-intent
- ✓ PP1 — All 86 pages 210×297 mm (A4), consistent
- ✗ PP2 — No bleed authored (digital intent — correct finding for an electronic policy)
- ✓ PP3 — Only 1 grayscale image, no RGB
- ✓ PP4 — Image renders at 279 DPI (above 150 threshold)
- ✗ PP5 — 85 of 86 pages use transparency / soft-masks (Adobe InDesign default; would need flattening for press)
- ✗ PP6 — No PDF/X conformance flag in XMP
- ✓ PP7 — No spot colour spaces
Updated full-profile score: 78.25 / Pass (8 checks now)

Demo conversation: "If you're distributing electronically, this PDF is fine. If you're going to press, you need to (1) author bleed in InDesign, (2) flatten transparency on export, (3) declare PDF/X-1a or PDF/X-4 conformance."

All phases status (2026-05-01)

Phase	Scope	Status	Cost	Wall
1	Spine + 6 deterministic doc-scope checks	✅ Done	$0	seconds
2	Compliance variants (font/phone/bold lists)	Blocked on AXA	—	—
3	Old-vs-new diff (vision LLM page-pair)	✅ Done	~$0.50/run	~3-5 min
4	PDF accessibility (PyMuPDF, veraPDF stub)	✅ Done	$0	seconds
5	Print preflight (PyMuPDF, Ghostscript later)	✅ Done	$0	seconds

Demo-ready as of 2026-05-01. All work is on feature/axa-document-mode, local-only, no commits or pushes yet.

Things to flag before any further build

80 pages × multiple LLM checks = serious cost. A doc with the existing static checks running per-page would be ~$5–10 in Gemini/OpenAI calls. We should decide which LLM checks need per-page vs once over a sampled set. Most should be deterministic-only.
veraPDF is Java. Adds a JRE dependency to GCP boxes.
PDF mode breaks "one upload = one report" assumption. Decide what to save: full per-page JSON, summary only, or both. (Phase 1 saves both.)
Reporting/billing. An 80-page doc is one analysis but 80× the LLM work. We should bill it as one analysis but track total checks separately. usage_tracker.log_analysis_complete already gets pages_processed in doc mode.

How doc-mode plugs into existing pipeline

For maintenance — the integration map:

Doc-mode component	Reuses existing	Where
Per-page check execution	`process_checks_in_batches()`	`api_server.py:498`
Per-check dispatch	`process_single_check()`	`api_server.py:377`
LLM call	`run_visual_qc()`	`llm_config.py`
Auth + client access	`auth.require_auth`, `_require_client_access()`	`api_server.py:4883`
Progress polling	`/api/progress/<id>`	`api_server.py:1695`
Output serving	`/output/<client>/<filename>`	`api_server.py:2121`
Output listing	`/api/output_files`	`api_server.py:2168`
Output folder	`ensure_client_output_folder()`	`api_server.py:856`
Profile loading	`profile_config.get_profile()`	`profile_config.py:219`
Profile visibility	`client_config.get_profiles_with_visibility()`	`client_config.py:82`
Usage logging	`usage_tracker.log_analysis_start/complete()`	`usage_tracker.py:73,100`

Future deterministic doc-mode checks should follow _run_dj_file_naming_check() (api_server.py:348) — short-circuit at the top of process_single_check before any LLM dispatch.

21 KiB Raw Permalink Blame History Unescape Escape

AXA Document-Mode QC — Build Plan

Sequencing (updated 2026-05-01)

Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation)

Architectural choice

Scope-aware dispatcher (Phase 1 refactor, 2026-05-01)

Phased delivery

Phase 0 — Discovery (you, before further code)

Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01 ✅

The 6 Phase-1 deterministic checks

Phase 2 — Deterministic checks (~3–5 days)

Phase 3 — Old-vs-new diff — DONE on feature/axa-document-mode 2026-05-01 ✅

Phase 4 — PDF accessibility — DONE on feature/axa-document-mode 2026-05-01 ✅

Phase 5 — Print preflight — DONE on feature/axa-document-mode 2026-05-01 ✅

All phases status (2026-05-01)

Things to flag before any further build

How doc-mode plugs into existing pipeline

21 KiB

Raw Permalink Blame History

Phase 3 — Old-vs-new diff — DONE on `feature/axa-document-mode` 2026-05-01 ✅

Phase 4 — PDF accessibility — DONE on `feature/axa-document-mode` 2026-05-01 ✅

Phase 5 — Print preflight — DONE on `feature/axa-document-mode` 2026-05-01 ✅