Multi-page PDF QC for AXA Ireland policy documents. Runs as a third mode alongside static + video, gated on profile.mode. New code isolated under backend/document_mode/ with new endpoints under /api/document/*. Phase 1 — Spine + 6 deterministic doc-scope checks ($0, runs in seconds): - Scope-aware dispatcher (document/targeted/page_sample/page_pair/page_each) - axa_font_inventory, axa_phone_inventory, axa_bold_words_definitions, axa_page_numbering, axa_print_code, axa_omg_versioning - Bootstrap bold-words dictionary extracted from Example 1 General Definitions Phase 3 — Old-vs-new diff (~$0.50/run, 3-5 min): - Page alignment via difflib SequenceMatcher (windowed fuzzy match) - Vision-LLM page-pair diff via Gemini 2.5 Pro (8 concurrent) - Two-slot upload UX, axa_policy_document_diff profile, mode=document_diff Phase 4 — PDF accessibility (PyMuPDF, $0): - 9 PDF/UA-1 aligned criteria (tagged structure, /MarkInfo, title, /Lang, encryption, font embedding, PDF version, XMP UA-conformance, alt-text) - _run_verapdf() stub for optional Java-based veraPDF integration later Phase 5 — Print preflight (PyMuPDF, $0): - 7 criteria (page geometry, bleed, image colour spaces, image DPI, transparency, PDF/X conformance, spot colours) Profile additions: - axa_policy_document — 8 deterministic checks, $0 cost - axa_policy_document_diff — 1 page-pair LLM check, ~$0.50/run API additions: - POST /api/document/start_analysis (single PDF) - POST /api/document/start_diff (old + new PDFs) Frontend additions: - Third profile.mode value (document_diff) in applyProfileMode() - Two-slot upload UX with PDF-only file pickers - checkFormValidity() branches by mode for the analyse-button gate Smoke-tested locally against Example 1 (Home Insurance V8, 86pp) and Example 2 (Landlord V1 vs V10, 68→74pp) with real findings caught including bold-words gaps, missing PDF/UA flag, transparency on press, V1→V10 bold-formatting fixes. Plan + integration map + gotchas in backend/AXA_DOCUMENT_MODE_PLAN.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 KiB
AXA Document-Mode QC — Build Plan
Multi-page PDF QC pipeline for AXA Ireland. Different from every other client onboard because the QC target is an 80-page policy PDF, not a single image/video. Sources of requirements:
axa_ireland/AXA build guide new tools required.txt— original ask + first scopingaxa_ireland/Exampole Folders explained.rtf— what the example folders representaxa_ireland/Example 1/— old vs new (post brand-refresh) Home Insurance policy + the human QC checklist (Policy documents QC CHECKS.docx)axa_ireland/Example 2/— Landlord Insurance v1 (shipped with errors), V10 (corrected), V1 2025 amends, original master. Phase-3 diff test pair: V1 vs V10.
Sequencing (updated 2026-05-01)
Inverted from "discover everything → build" to "build minimum demo → show client → drive deeper discovery from their reactions".
- Phase 1 refactor ✅ DONE on
feature/axa-document-mode2026-05-01 — scope-aware dispatcher, 6 deterministic checks, $0 LLM cost, runs in seconds. Replaces broken Phase-1 stub. - Phase 3 build (next) — old-vs-new diff using vision-LLM page-pair, no LlamaParse account needed.
- Show-and-tell with AXA — demonstrate Phases 1 + 3 working, show costs, gather requirements (font list, bold-words list, phone numbers, WCAG target, print preflight scope).
- Phases 2, 4, 5 in order, with local testing between each.
- Local-only until show-and-tell; do not push
feature/axa-document-modeto dev or merge to develop yet.
Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation)
| Question | Decision |
|---|---|
| Approved Monotype font list | Not yet supplied. Until then, axa_font_inventory lists fonts only (informational). When list arrives, becomes axa_font_compliance. |
| Bold-words dictionary | Bootstrap from Example 1 General Definitions (pages 8–10) until AXA supplies canonical list. 35 terms extracted, saved to backend/document_mode/data/axa_bold_words_seed.json. Some short terms (you, your, we) produce false positives — accept until canonical list lands. |
| Approved phone number list | Not yet supplied. axa_phone_inventory lists numbers only. Becomes axa_phone_compliance when list arrives. |
| WCAG target | AAA (Phase 4 veraPDF profile setting) |
| Print preflight scope | "Is it print-ready?" simple version. Expand later if needed. |
| Page sampling defaults | N=8 for visual sanity, N=5 for print sanity |
| Volume expectation | Still pending — drives later cost decisions |
Architectural choice
One web UI, isolated backend module. Doc-mode shares the existing shell (auth, client picker, Settings, Reporting, Admin, user-access, output history) and runs as a third mode alongside static + video. New code lives in backend/document_mode/ with new endpoints under /api/document/*, gated by mode: "document" on the profile JSON. Existing single-asset clients are not touched.
Rejected: a separate /ai_qc/documents/ page. Would re-implement the shell for one client and fork future doc-mode work per-client.
Scope-aware dispatcher (Phase 1 refactor, 2026-05-01)
Each check declares its scope in the profile JSON. The dispatcher routes by scope:
| Scope | Behaviour | Cost shape | Phase 1 uses? |
|---|---|---|---|
document |
Run once over the full ingest result. Deterministic checks live in document_mode/checks.py. |
$0, milliseconds | ✅ all 4 deterministic doc-checks |
targeted |
Run once on specific pages (scope_args.pages: first, last, first-N, last-N, or list). |
$0, milliseconds | ✅ print_code, omg_versioning |
page_sample |
Run on N evenly-spaced pages via existing batch dispatcher. | N × LLM call | Phase 4/5 |
page_pair |
Run on aligned old/new page pairs. | M × LLM call | Phase 3 |
page_each |
Run on every page (legacy / very expensive). | N × LLM call per check | Avoid |
Profile JSON shape:
"checks": {
"axa_font_inventory": {"weight": 1.0, "scope": "document", "enabled": true},
"axa_print_code": {"weight": 1.0, "scope": "targeted", "scope_args": {"pages": "last"}, "enabled": true}
}
The scope field is optional in the QCCheckConfig dataclass — defaults to None, which the doc-mode dispatcher treats as page_each for legacy compatibility. Asset-mode pipeline ignores scope entirely.
Phased delivery
Each phase ships as its own tag so we can demo / rollback in slices.
Phase 0 — Discovery (you, before further code)
Need from AXA before building Phase 2+:
- Approved Monotype font list (family + weights). Brand refresh moved them off old BOX-licensed fonts; we need the canonical list for
axa_font_compliance. - Bold-words dictionary for General Definitions (Example 2 says 70+).
- 2–3 more old/new PDF pairs beyond Examples 1 & 2 (ideally Motor Insurance for diversity).
- WCAG target — AA or AAA. EAA scope confirmation.
- Print preflight scope — "is it print-ready?" or full PDF/X-1a/4 compliance.
- Volume expectation — 80 pages × how many docs/month.
Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01 ✅
Original Phase 1 (2026-04-29): spine only, ran existing image-based accessibility per page on all 86 pages. Smoke test ran in ~70min for ~$0.50. Output report revealed every "failing page" failed for the same false-positive reason: "document is presented as an image of text / WCAG 1.4.5" — the LLM was critiquing our PNG rendering pipeline, not AXA's actual PDF. Result: noisy 67.7/Pass with no actionable findings.
Refactor (2026-05-01) on feature/axa-document-mode: scope-aware dispatcher + 6 deterministic doc-scope checks. Same Home Insurance PDF now scores 81.4/Pass in seconds at $0 cost, with real findings: font inventory, phone-number inventory, 132 non-bold defined-term occurrences flagged across 53 pages, 5 page-numbering discontinuities, print code "AG400 11/25" detected.
Files added:
backend/document_mode/__init__.pybackend/document_mode/ingest.py— multi-page PDF → per-page PNGs + per-span structured text (font, size, bold flag, italic flag, bbox). Uses PyMuPDF. Bold detection =flags & 16OR font name containsbold|black|heavy. Default zoom 2.0×, max dim 1600 px, page cap 200.backend/document_mode/dispatcher.py— scope-aware routing.documentandtargetedchecks bypass LLM;page_sample/page_eachuse existing batch dispatcher;page_pairreserved for Phase 3.backend/document_mode/checks.py— registry of 6 deterministic doc-scope checks (see table below). Each returns{check_name, scope, score, pass, summary, findings, response}.backend/document_mode/data/axa_bold_words_seed.json— bootstrap dictionary, 35 terms extracted from Example 1 General Definitions (pages 8-10).backend/document_mode/result_writer.py— writes JSON + self-contained HTML with: at-a-glance findings table, per-check sections with structured renderers (font/phone/bold-words/page-numbering/print-code/OMG each get their own table), per-page summary strip. Reports collapsed by default.backend/profiles/axa_policy_document.json— production profile withmode: "document", 6 deterministic checks,visibility: client_specific, visible_to_clients: ["axa"].
The 6 Phase-1 deterministic checks
| Check | Scope | What it does | Becomes (when client supplies data) |
|---|---|---|---|
axa_font_inventory |
document | Lists every unique font + per-page distribution | axa_font_compliance (flags non-approved) |
axa_phone_inventory |
document | Regex-extracts every phone-shaped number, dedup, with page refs | axa_phone_compliance (flags non-approved) |
axa_bold_words_definitions |
document | Scans for seed-dictionary terms, flags non-bold occurrences | Same — replace seed dict with AXA's canonical list |
axa_page_numbering |
document | Detects standalone-line integers near top/bottom, flags discontinuities | Same |
axa_print_code |
targeted: last | Finds back-page print/version line components (code + ref + date + version) | Same — refine regex once AXA confirms format |
axa_omg_versioning |
targeted: last | Finds OMG code + date format on back page | Same |
Files modified:
backend/profile_config.py—Profile.modefield defaults to"asset".QCCheckConfiggainsscopeandscope_argsfields, both optional. Persisted only when non-default.backend/api_server.py—POST /api/document/start_analysisendpoint. Theenabled_checksfilter accepts checks from the document-mode registry (is_document_scope_check) in addition to the legacyqc_appsregistry, so deterministic AXA checks aren't filtered out.backend/client_config.py— AXA client gainsaxa_policy_documentas first profile.web_ui.html— doc-mode banner under upload area, file-inputacceptswapped to PDF-only,performAnalysisWithProgressroutes to/api/document/start_analysiswithclient_id.
Smoke-tested 2026-05-01 (post-refactor) against same Home Insurance PDF:
- Score: 81.4 / 100 (Pass)
- Total runtime: a few seconds (deterministic only, no LLM calls)
- Total cost: $0
- Findings: 10 fonts, 8 phone numbers, 132 non-bold defined-term occurrences across 53 pages, 5 page-number discontinuities, print code "AG400" + "11/25" detected on back page, no OMG present.
- Smoke-test report:
backend/output-dev/axa/PHASE1_REFACTOR_smoke_test_report.html
Local test plan (after ./scripts/run-local.sh):
- Pick AXA client → AXA Policy Document profile
- Upload
axa_ireland/Example 1/6317047 - AXA - Home Insurance Policy 2025 V8 final new brand.pdf - Verify: doc-mode banner, PDF-only picker, progress completes in seconds, report appears with at-a-glance table + per-check sections + structured findings.
Known gotchas to surface during demo:
- Bold-words bootstrap dictionary contains short terms (
you,your,we,us) which produce false positives in normal pronoun usage. Mitigated by Phase-2 work (canonical list from AXA). - Page-numbering heuristic catches TOC-page numbers as false-positive "page numbers" (5 such hits in this doc). Surface as data, score gently.
- Print-code regex tuned to "AG400 11/25" pattern observed in Example 1; may need tuning for other docs.
Phase 2 — Deterministic checks (~3–5 days)
The cheap, accurate wins. No LLM cost.
backend/document_mode/font_compliance.py— reads PDF font inventory, flags anything not on AXA's approved list. Per-page failure log. Plugs intoprocess_single_checkvia the same early-branch pattern asdj_file_naming(line ~384 inapi_server.py).backend/document_mode/bold_words.py— scans pages for AXA's bold-words dictionary, flags any occurrence not rendered bold.backend/document_mode/print_code.py— extracts back-page print code, optionally compares to brief-supplied value.backend/document_mode/omg_versioning.py— confirms back-page OMG number + date format compliance (regex pattern, similar tofile_naming_validator).- New Settings → "AXA Configuration" tab for uploading approved fonts list + bold-words list per client (same pattern as media plan upload).
Phase 3 — Old-vs-new diff — DONE on feature/axa-document-mode 2026-05-01 ✅
Vision-LLM-based page-pair diff. Validates the original Example-2 promise: catches the bold-formatting fixes, structural changes, definition updates, and content additions/removals that V1 missed and V10 fixed.
Files added:
backend/document_mode/diff_engine.py— page alignment via difflib SequenceMatcher (windowed fuzzy match, threshold 0.4) + parallel page-pair vision-LLM diff via Gemini 2.5 Pro (8 concurrent). Returns alignment map + structured diff JSON per pair (added/removed/modified/moved/style_changes/severity).backend/document_mode/diff_report_writer.py— diff-specific HTML/JSON. Versions card, at-a-glance grid (page count delta, severity counts), full alignment table, per-pair cards with severity pills + categorised diff blocks.backend/profiles/axa_policy_document_diff.json—mode: document_diffprofile.
Files modified:
backend/api_server.py— newPOST /api/document/start_diffendpoint acceptingold_file+new_file. Reuses_require_client_access,progress_tracker,ensure_client_output_folder,usage_tracker.backend/client_config.py— AXA profile list gainsaxa_policy_document_diff.web_ui.html— thirdmode: document_diffUX path. Two-slot drop-zone (old + new).applyProfileMode()swaps between asset/document/document_diff.wireDiffPickers()wires the dual file pickers.startAnalysis()+performAnalysisWithProgress()route diff-mode submissions to/api/document/start_diff.
Smoke-test 2026-05-01 — V1 (68 pages, broken) vs V10 (74 pages, corrected):
- Wall: 214 seconds (3:34)
- Tokens: 214,342 (cost ≈ $0.50–$0.70)
- 63 matched pairs · 11 pages added in V10 · 5 pages removed
- 61 pages with differences flagged · 2 unchanged
- Severity: 25 high, 32 medium, 4 low
- Score 0/100, "Major changes" — correct call
- Smoke-test report:
backend/output-dev/axa/phase3_smoke_*_diff_report.html
Caught the Example-2-class defects:
- Bold-formatting changes (e.g. "the terms 'us', 'we', and 'adviser' are now bolded", "the term 'your' is now bolded") — exactly the missed-bold issue that motivated this build.
- New Section F: Legal Protection added in V10 — structural insertion caught.
- New "Period of Insurance" definition added — defined-term addition caught.
- "Employee" definition expanded by a sub-point — definition modification caught.
- Wording fix: "supply your own expense" → "supply at your own expense" — body-text correction caught.
- 11 pages flagged as added, 5 as removed — page-count delta and structural restructure caught.
Cost dial (for show-and-tell): ~$0.40–0.70 per diff against a typical 70-80-page policy.
Local test plan (UI):
./scripts/run-local.sh- AXA → AXA Policy Document — Old vs New Diff
- Pick V1.pdf as old, V10.pdf as new (both from
axa_ireland/Example 2/) - Click analyse. Wait ~3-5 minutes. Report lands in saved files.
Phase 4 — PDF accessibility — DONE on feature/axa-document-mode 2026-05-01 ✅
Pure-Python implementation. Original plan was veraPDF subprocess (Java dependency, ~150MB install). Built deterministic PyMuPDF-based check instead — no Java needed for the demo, with veraPDF as an optional add-on later.
Files added:
backend/document_mode/accessibility_checks.py— 9 PDF/UA-aligned criteria checked deterministically:- C1 Tagged PDF (StructTreeRoot present)
- C2 Marked content (/MarkInfo /Marked true)
- C3 Document title metadata
- C4 Document language (/Lang)
- C5 No password protection blocking AT
- C6 All fonts embedded
- C7 PDF version ≥ 1.5
- C8 XMP UA-conformance declaration
- C9 Alt text on images (sampling)
- Plus a
_run_verapdf()stub for future veraPDF integration
Files modified:
backend/document_mode/checks.py—axa_pdf_accessibilityregistered.backend/document_mode/ingest.py—pdf_pathadded to ingest_result so doc-scope checks can read raw PDF structure.backend/document_mode/result_writer.py—_render_pdf_accessibilitystructured renderer (criteria checklist with pass/fail markers).backend/profiles/axa_policy_document.json—axa_pdf_accessibilityadded at weight 2.0.
Smoke-test 2026-05-01 against Example 1 Home Insurance V8 (Adobe InDesign output):
- Overall AXA Policy Document score: 80.6 / Pass (7 checks, $0 cost, runs in seconds)
- Accessibility check: 7.78 / 10 (7 of 9 criteria passed)
- Real gaps caught:
- C7 fail: PDF 1.4 — should be 1.5+ for full accessibility tagging support
- C8 fail: No PDF/UA-1 conformance flag in XMP metadata
- Pass: tagged structure, marked content, title set, /Lang=en, no encryption, all 10 fonts embedded, alt-text entries detected
veraPDF integration plan (when ready):
- Install veraPDF on host: https://verapdf.org/software/ (requires JRE 8+, ~150MB)
- Ensure
verapdfbinary on PATH or setVERAPDF_BINenv var - Replace
_run_verapdf()stub withsubprocess.run([verapdf, '--format', 'json', '--profile', 'ua1', pdf_path], capture_output=True)and merge JSON findings intoaxa_pdf_accessibility's output - Set
findings['verapdf_run'] = True
Phase 5 — Print preflight — DONE on feature/axa-document-mode 2026-05-01 ✅
Pure-Python implementation. Original plan was Ghostscript-based; built deterministic PyMuPDF checks instead — same approach as Phase 4. Ghostscript can plug in later for total-ink-coverage / registration-black if scope grows.
Files added:
backend/document_mode/print_preflight_checks.py— 7 deterministic preflight criteria:- PP1 Page geometry consistency (single MediaBox size across all pages)
- PP2 Bleed area defined (TrimBox/BleedBox differ from MediaBox)
- PP3 Image colour spaces (flag DeviceRGB; press wants CMYK/Gray)
- PP4 Image effective DPI (raw pixels / rendered inches; flag < 150)
- PP5 Transparency / soft-mask usage (flag for flattening)
- PP6 PDF/X conformance (XMP
pdfxid:GTS_PDFXVersion) - PP7 Spot colour usage (flag /Separation, /DeviceN)
Files modified:
backend/document_mode/checks.py—axa_print_preflightregistered.backend/document_mode/result_writer.py—_render_print_preflightstructured renderer with low-DPI image list, colour-space breakdown, spot-colour list, page-size detail.backend/profiles/axa_policy_document.json—axa_print_preflightadded at weight 1.0.
Smoke-test 2026-05-01 against Example 1 Home Insurance V8:
- Print preflight: 5.71 / 10 (4 of 7 criteria pass) — correctly flags as digital-intent
- ✓ PP1 — All 86 pages 210×297 mm (A4), consistent
- ✗ PP2 — No bleed authored (digital intent — correct finding for an electronic policy)
- ✓ PP3 — Only 1 grayscale image, no RGB
- ✓ PP4 — Image renders at 279 DPI (above 150 threshold)
- ✗ PP5 — 85 of 86 pages use transparency / soft-masks (Adobe InDesign default; would need flattening for press)
- ✗ PP6 — No PDF/X conformance flag in XMP
- ✓ PP7 — No spot colour spaces
- Updated full-profile score: 78.25 / Pass (8 checks now)
Demo conversation: "If you're distributing electronically, this PDF is fine. If you're going to press, you need to (1) author bleed in InDesign, (2) flatten transparency on export, (3) declare PDF/X-1a or PDF/X-4 conformance."
All phases status (2026-05-01)
| Phase | Scope | Status | Cost | Wall |
|---|---|---|---|---|
| 1 | Spine + 6 deterministic doc-scope checks | ✅ Done | $0 | seconds |
| 2 | Compliance variants (font/phone/bold lists) | Blocked on AXA | — | — |
| 3 | Old-vs-new diff (vision LLM page-pair) | ✅ Done | ~$0.50/run | ~3-5 min |
| 4 | PDF accessibility (PyMuPDF, veraPDF stub) | ✅ Done | $0 | seconds |
| 5 | Print preflight (PyMuPDF, Ghostscript later) | ✅ Done | $0 | seconds |
Demo-ready as of 2026-05-01. All work is on feature/axa-document-mode, local-only, no commits or pushes yet.
Things to flag before any further build
- 80 pages × multiple LLM checks = serious cost. A doc with the existing static checks running per-page would be ~$5–10 in Gemini/OpenAI calls. We should decide which LLM checks need per-page vs once over a sampled set. Most should be deterministic-only.
- veraPDF is Java. Adds a JRE dependency to GCP boxes.
- PDF mode breaks "one upload = one report" assumption. Decide what to save: full per-page JSON, summary only, or both. (Phase 1 saves both.)
- Reporting/billing. An 80-page doc is one analysis but 80× the LLM work. We should bill it as one analysis but track total checks separately.
usage_tracker.log_analysis_completealready getspages_processedin doc mode.
How doc-mode plugs into existing pipeline
For maintenance — the integration map:
| Doc-mode component | Reuses existing | Where |
|---|---|---|
| Per-page check execution | process_checks_in_batches() |
api_server.py:498 |
| Per-check dispatch | process_single_check() |
api_server.py:377 |
| LLM call | run_visual_qc() |
llm_config.py |
| Auth + client access | auth.require_auth, _require_client_access() |
api_server.py:4883 |
| Progress polling | /api/progress/<id> |
api_server.py:1695 |
| Output serving | /output/<client>/<filename> |
api_server.py:2121 |
| Output listing | /api/output_files |
api_server.py:2168 |
| Output folder | ensure_client_output_folder() |
api_server.py:856 |
| Profile loading | profile_config.get_profile() |
profile_config.py:219 |
| Profile visibility | client_config.get_profiles_with_visibility() |
client_config.py:82 |
| Usage logging | usage_tracker.log_analysis_start/complete() |
usage_tracker.py:73,100 |
Future deterministic doc-mode checks should follow _run_dj_file_naming_check() (api_server.py:348) — short-circuit at the top of process_single_check before any LLM dispatch.