ai_qc/backend/AXA_DOCUMENT_MODE_PLAN.md
nickviljoen 90563b8cf2 Add AXA document-mode QC pipeline (Phases 1, 3, 4, 5)
Multi-page PDF QC for AXA Ireland policy documents. Runs as a third mode
alongside static + video, gated on profile.mode. New code isolated under
backend/document_mode/ with new endpoints under /api/document/*.

Phase 1 — Spine + 6 deterministic doc-scope checks ($0, runs in seconds):
- Scope-aware dispatcher (document/targeted/page_sample/page_pair/page_each)
- axa_font_inventory, axa_phone_inventory, axa_bold_words_definitions,
  axa_page_numbering, axa_print_code, axa_omg_versioning
- Bootstrap bold-words dictionary extracted from Example 1 General Definitions

Phase 3 — Old-vs-new diff (~$0.50/run, 3-5 min):
- Page alignment via difflib SequenceMatcher (windowed fuzzy match)
- Vision-LLM page-pair diff via Gemini 2.5 Pro (8 concurrent)
- Two-slot upload UX, axa_policy_document_diff profile, mode=document_diff

Phase 4 — PDF accessibility (PyMuPDF, $0):
- 9 PDF/UA-1 aligned criteria (tagged structure, /MarkInfo, title, /Lang,
  encryption, font embedding, PDF version, XMP UA-conformance, alt-text)
- _run_verapdf() stub for optional Java-based veraPDF integration later

Phase 5 — Print preflight (PyMuPDF, $0):
- 7 criteria (page geometry, bleed, image colour spaces, image DPI,
  transparency, PDF/X conformance, spot colours)

Profile additions:
- axa_policy_document — 8 deterministic checks, $0 cost
- axa_policy_document_diff — 1 page-pair LLM check, ~$0.50/run

API additions:
- POST /api/document/start_analysis (single PDF)
- POST /api/document/start_diff (old + new PDFs)

Frontend additions:
- Third profile.mode value (document_diff) in applyProfileMode()
- Two-slot upload UX with PDF-only file pickers
- checkFormValidity() branches by mode for the analyse-button gate

Smoke-tested locally against Example 1 (Home Insurance V8, 86pp) and
Example 2 (Landlord V1 vs V10, 68→74pp) with real findings caught
including bold-words gaps, missing PDF/UA flag, transparency on press,
V1→V10 bold-formatting fixes. Plan + integration map + gotchas in
backend/AXA_DOCUMENT_MODE_PLAN.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:38:14 +02:00

21 KiB
Raw Permalink Blame History

AXA Document-Mode QC — Build Plan

Multi-page PDF QC pipeline for AXA Ireland. Different from every other client onboard because the QC target is an 80-page policy PDF, not a single image/video. Sources of requirements:

  • axa_ireland/AXA build guide new tools required.txt — original ask + first scoping
  • axa_ireland/Exampole Folders explained.rtf — what the example folders represent
  • axa_ireland/Example 1/ — old vs new (post brand-refresh) Home Insurance policy + the human QC checklist (Policy documents QC CHECKS.docx)
  • axa_ireland/Example 2/ — Landlord Insurance v1 (shipped with errors), V10 (corrected), V1 2025 amends, original master. Phase-3 diff test pair: V1 vs V10.

Sequencing (updated 2026-05-01)

Inverted from "discover everything → build" to "build minimum demo → show client → drive deeper discovery from their reactions".

  1. Phase 1 refactor DONE on feature/axa-document-mode 2026-05-01 — scope-aware dispatcher, 6 deterministic checks, $0 LLM cost, runs in seconds. Replaces broken Phase-1 stub.
  2. Phase 3 build (next) — old-vs-new diff using vision-LLM page-pair, no LlamaParse account needed.
  3. Show-and-tell with AXA — demonstrate Phases 1 + 3 working, show costs, gather requirements (font list, bold-words list, phone numbers, WCAG target, print preflight scope).
  4. Phases 2, 4, 5 in order, with local testing between each.
  5. Local-only until show-and-tell; do not push feature/axa-document-mode to dev or merge to develop yet.

Phase 0 discovery — answered 2026-05-01 (provisional, pending client confirmation)

Question Decision
Approved Monotype font list Not yet supplied. Until then, axa_font_inventory lists fonts only (informational). When list arrives, becomes axa_font_compliance.
Bold-words dictionary Bootstrap from Example 1 General Definitions (pages 810) until AXA supplies canonical list. 35 terms extracted, saved to backend/document_mode/data/axa_bold_words_seed.json. Some short terms (you, your, we) produce false positives — accept until canonical list lands.
Approved phone number list Not yet supplied. axa_phone_inventory lists numbers only. Becomes axa_phone_compliance when list arrives.
WCAG target AAA (Phase 4 veraPDF profile setting)
Print preflight scope "Is it print-ready?" simple version. Expand later if needed.
Page sampling defaults N=8 for visual sanity, N=5 for print sanity
Volume expectation Still pending — drives later cost decisions

Architectural choice

One web UI, isolated backend module. Doc-mode shares the existing shell (auth, client picker, Settings, Reporting, Admin, user-access, output history) and runs as a third mode alongside static + video. New code lives in backend/document_mode/ with new endpoints under /api/document/*, gated by mode: "document" on the profile JSON. Existing single-asset clients are not touched.

Rejected: a separate /ai_qc/documents/ page. Would re-implement the shell for one client and fork future doc-mode work per-client.

Scope-aware dispatcher (Phase 1 refactor, 2026-05-01)

Each check declares its scope in the profile JSON. The dispatcher routes by scope:

Scope Behaviour Cost shape Phase 1 uses?
document Run once over the full ingest result. Deterministic checks live in document_mode/checks.py. $0, milliseconds all 4 deterministic doc-checks
targeted Run once on specific pages (scope_args.pages: first, last, first-N, last-N, or list). $0, milliseconds print_code, omg_versioning
page_sample Run on N evenly-spaced pages via existing batch dispatcher. N × LLM call Phase 4/5
page_pair Run on aligned old/new page pairs. M × LLM call Phase 3
page_each Run on every page (legacy / very expensive). N × LLM call per check Avoid

Profile JSON shape:

"checks": {
  "axa_font_inventory": {"weight": 1.0, "scope": "document", "enabled": true},
  "axa_print_code": {"weight": 1.0, "scope": "targeted", "scope_args": {"pages": "last"}, "enabled": true}
}

The scope field is optional in the QCCheckConfig dataclass — defaults to None, which the doc-mode dispatcher treats as page_each for legacy compatibility. Asset-mode pipeline ignores scope entirely.

Phased delivery

Each phase ships as its own tag so we can demo / rollback in slices.

Phase 0 — Discovery (you, before further code)

Need from AXA before building Phase 2+:

  • Approved Monotype font list (family + weights). Brand refresh moved them off old BOX-licensed fonts; we need the canonical list for axa_font_compliance.
  • Bold-words dictionary for General Definitions (Example 2 says 70+).
  • 23 more old/new PDF pairs beyond Examples 1 & 2 (ideally Motor Insurance for diversity).
  • WCAG target — AA or AAA. EAA scope confirmation.
  • Print preflight scope — "is it print-ready?" or full PDF/X-1a/4 compliance.
  • Volume expectation — 80 pages × how many docs/month.

Phase 1 — Document-mode plumbing + deterministic checks — REFACTORED 2026-05-01

Original Phase 1 (2026-04-29): spine only, ran existing image-based accessibility per page on all 86 pages. Smoke test ran in ~70min for ~$0.50. Output report revealed every "failing page" failed for the same false-positive reason: "document is presented as an image of text / WCAG 1.4.5" — the LLM was critiquing our PNG rendering pipeline, not AXA's actual PDF. Result: noisy 67.7/Pass with no actionable findings.

Refactor (2026-05-01) on feature/axa-document-mode: scope-aware dispatcher + 6 deterministic doc-scope checks. Same Home Insurance PDF now scores 81.4/Pass in seconds at $0 cost, with real findings: font inventory, phone-number inventory, 132 non-bold defined-term occurrences flagged across 53 pages, 5 page-numbering discontinuities, print code "AG400 11/25" detected.

Files added:

  • backend/document_mode/__init__.py
  • backend/document_mode/ingest.py — multi-page PDF → per-page PNGs + per-span structured text (font, size, bold flag, italic flag, bbox). Uses PyMuPDF. Bold detection = flags & 16 OR font name contains bold|black|heavy. Default zoom 2.0×, max dim 1600 px, page cap 200.
  • backend/document_mode/dispatcher.pyscope-aware routing. document and targeted checks bypass LLM; page_sample / page_each use existing batch dispatcher; page_pair reserved for Phase 3.
  • backend/document_mode/checks.py — registry of 6 deterministic doc-scope checks (see table below). Each returns {check_name, scope, score, pass, summary, findings, response}.
  • backend/document_mode/data/axa_bold_words_seed.json — bootstrap dictionary, 35 terms extracted from Example 1 General Definitions (pages 8-10).
  • backend/document_mode/result_writer.py — writes JSON + self-contained HTML with: at-a-glance findings table, per-check sections with structured renderers (font/phone/bold-words/page-numbering/print-code/OMG each get their own table), per-page summary strip. Reports collapsed by default.
  • backend/profiles/axa_policy_document.json — production profile with mode: "document", 6 deterministic checks, visibility: client_specific, visible_to_clients: ["axa"].

The 6 Phase-1 deterministic checks

Check Scope What it does Becomes (when client supplies data)
axa_font_inventory document Lists every unique font + per-page distribution axa_font_compliance (flags non-approved)
axa_phone_inventory document Regex-extracts every phone-shaped number, dedup, with page refs axa_phone_compliance (flags non-approved)
axa_bold_words_definitions document Scans for seed-dictionary terms, flags non-bold occurrences Same — replace seed dict with AXA's canonical list
axa_page_numbering document Detects standalone-line integers near top/bottom, flags discontinuities Same
axa_print_code targeted: last Finds back-page print/version line components (code + ref + date + version) Same — refine regex once AXA confirms format
axa_omg_versioning targeted: last Finds OMG code + date format on back page Same

Files modified:

  • backend/profile_config.pyProfile.mode field defaults to "asset". QCCheckConfig gains scope and scope_args fields, both optional. Persisted only when non-default.
  • backend/api_server.pyPOST /api/document/start_analysis endpoint. The enabled_checks filter accepts checks from the document-mode registry (is_document_scope_check) in addition to the legacy qc_apps registry, so deterministic AXA checks aren't filtered out.
  • backend/client_config.py — AXA client gains axa_policy_document as first profile.
  • web_ui.html — doc-mode banner under upload area, file-input accept swapped to PDF-only, performAnalysisWithProgress routes to /api/document/start_analysis with client_id.

Smoke-tested 2026-05-01 (post-refactor) against same Home Insurance PDF:

  • Score: 81.4 / 100 (Pass)
  • Total runtime: a few seconds (deterministic only, no LLM calls)
  • Total cost: $0
  • Findings: 10 fonts, 8 phone numbers, 132 non-bold defined-term occurrences across 53 pages, 5 page-number discontinuities, print code "AG400" + "11/25" detected on back page, no OMG present.
  • Smoke-test report: backend/output-dev/axa/PHASE1_REFACTOR_smoke_test_report.html

Local test plan (after ./scripts/run-local.sh):

  1. Pick AXA client → AXA Policy Document profile
  2. Upload axa_ireland/Example 1/6317047 - AXA - Home Insurance Policy 2025 V8 final new brand.pdf
  3. Verify: doc-mode banner, PDF-only picker, progress completes in seconds, report appears with at-a-glance table + per-check sections + structured findings.

Known gotchas to surface during demo:

  • Bold-words bootstrap dictionary contains short terms (you, your, we, us) which produce false positives in normal pronoun usage. Mitigated by Phase-2 work (canonical list from AXA).
  • Page-numbering heuristic catches TOC-page numbers as false-positive "page numbers" (5 such hits in this doc). Surface as data, score gently.
  • Print-code regex tuned to "AG400 11/25" pattern observed in Example 1; may need tuning for other docs.

Phase 2 — Deterministic checks (~35 days)

The cheap, accurate wins. No LLM cost.

  • backend/document_mode/font_compliance.py — reads PDF font inventory, flags anything not on AXA's approved list. Per-page failure log. Plugs into process_single_check via the same early-branch pattern as dj_file_naming (line ~384 in api_server.py).
  • backend/document_mode/bold_words.py — scans pages for AXA's bold-words dictionary, flags any occurrence not rendered bold.
  • backend/document_mode/print_code.py — extracts back-page print code, optionally compares to brief-supplied value.
  • backend/document_mode/omg_versioning.py — confirms back-page OMG number + date format compliance (regex pattern, similar to file_naming_validator).
  • New Settings → "AXA Configuration" tab for uploading approved fonts list + bold-words list per client (same pattern as media plan upload).

Phase 3 — Old-vs-new diff — DONE on feature/axa-document-mode 2026-05-01

Vision-LLM-based page-pair diff. Validates the original Example-2 promise: catches the bold-formatting fixes, structural changes, definition updates, and content additions/removals that V1 missed and V10 fixed.

Files added:

  • backend/document_mode/diff_engine.py — page alignment via difflib SequenceMatcher (windowed fuzzy match, threshold 0.4) + parallel page-pair vision-LLM diff via Gemini 2.5 Pro (8 concurrent). Returns alignment map + structured diff JSON per pair (added/removed/modified/moved/style_changes/severity).
  • backend/document_mode/diff_report_writer.py — diff-specific HTML/JSON. Versions card, at-a-glance grid (page count delta, severity counts), full alignment table, per-pair cards with severity pills + categorised diff blocks.
  • backend/profiles/axa_policy_document_diff.jsonmode: document_diff profile.

Files modified:

  • backend/api_server.py — new POST /api/document/start_diff endpoint accepting old_file + new_file. Reuses _require_client_access, progress_tracker, ensure_client_output_folder, usage_tracker.
  • backend/client_config.py — AXA profile list gains axa_policy_document_diff.
  • web_ui.html — third mode: document_diff UX path. Two-slot drop-zone (old + new). applyProfileMode() swaps between asset/document/document_diff. wireDiffPickers() wires the dual file pickers. startAnalysis() + performAnalysisWithProgress() route diff-mode submissions to /api/document/start_diff.

Smoke-test 2026-05-01 — V1 (68 pages, broken) vs V10 (74 pages, corrected):

  • Wall: 214 seconds (3:34)
  • Tokens: 214,342 (cost ≈ $0.50$0.70)
  • 63 matched pairs · 11 pages added in V10 · 5 pages removed
  • 61 pages with differences flagged · 2 unchanged
  • Severity: 25 high, 32 medium, 4 low
  • Score 0/100, "Major changes" — correct call
  • Smoke-test report: backend/output-dev/axa/phase3_smoke_*_diff_report.html

Caught the Example-2-class defects:

  • Bold-formatting changes (e.g. "the terms 'us', 'we', and 'adviser' are now bolded", "the term 'your' is now bolded") — exactly the missed-bold issue that motivated this build.
  • New Section F: Legal Protection added in V10 — structural insertion caught.
  • New "Period of Insurance" definition added — defined-term addition caught.
  • "Employee" definition expanded by a sub-point — definition modification caught.
  • Wording fix: "supply your own expense" → "supply at your own expense" — body-text correction caught.
  • 11 pages flagged as added, 5 as removed — page-count delta and structural restructure caught.

Cost dial (for show-and-tell): ~$0.400.70 per diff against a typical 70-80-page policy.

Local test plan (UI):

  1. ./scripts/run-local.sh
  2. AXA → AXA Policy Document — Old vs New Diff
  3. Pick V1.pdf as old, V10.pdf as new (both from axa_ireland/Example 2/)
  4. Click analyse. Wait ~3-5 minutes. Report lands in saved files.

Phase 4 — PDF accessibility — DONE on feature/axa-document-mode 2026-05-01

Pure-Python implementation. Original plan was veraPDF subprocess (Java dependency, ~150MB install). Built deterministic PyMuPDF-based check instead — no Java needed for the demo, with veraPDF as an optional add-on later.

Files added:

  • backend/document_mode/accessibility_checks.py — 9 PDF/UA-aligned criteria checked deterministically:
    • C1 Tagged PDF (StructTreeRoot present)
    • C2 Marked content (/MarkInfo /Marked true)
    • C3 Document title metadata
    • C4 Document language (/Lang)
    • C5 No password protection blocking AT
    • C6 All fonts embedded
    • C7 PDF version ≥ 1.5
    • C8 XMP UA-conformance declaration
    • C9 Alt text on images (sampling)
    • Plus a _run_verapdf() stub for future veraPDF integration

Files modified:

  • backend/document_mode/checks.pyaxa_pdf_accessibility registered.
  • backend/document_mode/ingest.pypdf_path added to ingest_result so doc-scope checks can read raw PDF structure.
  • backend/document_mode/result_writer.py_render_pdf_accessibility structured renderer (criteria checklist with pass/fail markers).
  • backend/profiles/axa_policy_document.jsonaxa_pdf_accessibility added at weight 2.0.

Smoke-test 2026-05-01 against Example 1 Home Insurance V8 (Adobe InDesign output):

  • Overall AXA Policy Document score: 80.6 / Pass (7 checks, $0 cost, runs in seconds)
  • Accessibility check: 7.78 / 10 (7 of 9 criteria passed)
  • Real gaps caught:
    • C7 fail: PDF 1.4 — should be 1.5+ for full accessibility tagging support
    • C8 fail: No PDF/UA-1 conformance flag in XMP metadata
  • Pass: tagged structure, marked content, title set, /Lang=en, no encryption, all 10 fonts embedded, alt-text entries detected

veraPDF integration plan (when ready):

  1. Install veraPDF on host: https://verapdf.org/software/ (requires JRE 8+, ~150MB)
  2. Ensure verapdf binary on PATH or set VERAPDF_BIN env var
  3. Replace _run_verapdf() stub with subprocess.run([verapdf, '--format', 'json', '--profile', 'ua1', pdf_path], capture_output=True) and merge JSON findings into axa_pdf_accessibility's output
  4. Set findings['verapdf_run'] = True

Phase 5 — Print preflight — DONE on feature/axa-document-mode 2026-05-01

Pure-Python implementation. Original plan was Ghostscript-based; built deterministic PyMuPDF checks instead — same approach as Phase 4. Ghostscript can plug in later for total-ink-coverage / registration-black if scope grows.

Files added:

  • backend/document_mode/print_preflight_checks.py — 7 deterministic preflight criteria:
    • PP1 Page geometry consistency (single MediaBox size across all pages)
    • PP2 Bleed area defined (TrimBox/BleedBox differ from MediaBox)
    • PP3 Image colour spaces (flag DeviceRGB; press wants CMYK/Gray)
    • PP4 Image effective DPI (raw pixels / rendered inches; flag < 150)
    • PP5 Transparency / soft-mask usage (flag for flattening)
    • PP6 PDF/X conformance (XMP pdfxid:GTS_PDFXVersion)
    • PP7 Spot colour usage (flag /Separation, /DeviceN)

Files modified:

  • backend/document_mode/checks.pyaxa_print_preflight registered.
  • backend/document_mode/result_writer.py_render_print_preflight structured renderer with low-DPI image list, colour-space breakdown, spot-colour list, page-size detail.
  • backend/profiles/axa_policy_document.jsonaxa_print_preflight added at weight 1.0.

Smoke-test 2026-05-01 against Example 1 Home Insurance V8:

  • Print preflight: 5.71 / 10 (4 of 7 criteria pass) — correctly flags as digital-intent
    • ✓ PP1 — All 86 pages 210×297 mm (A4), consistent
    • ✗ PP2 — No bleed authored (digital intent — correct finding for an electronic policy)
    • ✓ PP3 — Only 1 grayscale image, no RGB
    • ✓ PP4 — Image renders at 279 DPI (above 150 threshold)
    • ✗ PP5 — 85 of 86 pages use transparency / soft-masks (Adobe InDesign default; would need flattening for press)
    • ✗ PP6 — No PDF/X conformance flag in XMP
    • ✓ PP7 — No spot colour spaces
  • Updated full-profile score: 78.25 / Pass (8 checks now)

Demo conversation: "If you're distributing electronically, this PDF is fine. If you're going to press, you need to (1) author bleed in InDesign, (2) flatten transparency on export, (3) declare PDF/X-1a or PDF/X-4 conformance."

All phases status (2026-05-01)

Phase Scope Status Cost Wall
1 Spine + 6 deterministic doc-scope checks Done $0 seconds
2 Compliance variants (font/phone/bold lists) Blocked on AXA
3 Old-vs-new diff (vision LLM page-pair) Done ~$0.50/run ~3-5 min
4 PDF accessibility (PyMuPDF, veraPDF stub) Done $0 seconds
5 Print preflight (PyMuPDF, Ghostscript later) Done $0 seconds

Demo-ready as of 2026-05-01. All work is on feature/axa-document-mode, local-only, no commits or pushes yet.

Things to flag before any further build

  1. 80 pages × multiple LLM checks = serious cost. A doc with the existing static checks running per-page would be ~$510 in Gemini/OpenAI calls. We should decide which LLM checks need per-page vs once over a sampled set. Most should be deterministic-only.
  2. veraPDF is Java. Adds a JRE dependency to GCP boxes.
  3. PDF mode breaks "one upload = one report" assumption. Decide what to save: full per-page JSON, summary only, or both. (Phase 1 saves both.)
  4. Reporting/billing. An 80-page doc is one analysis but 80× the LLM work. We should bill it as one analysis but track total checks separately. usage_tracker.log_analysis_complete already gets pages_processed in doc mode.

How doc-mode plugs into existing pipeline

For maintenance — the integration map:

Doc-mode component Reuses existing Where
Per-page check execution process_checks_in_batches() api_server.py:498
Per-check dispatch process_single_check() api_server.py:377
LLM call run_visual_qc() llm_config.py
Auth + client access auth.require_auth, _require_client_access() api_server.py:4883
Progress polling /api/progress/<id> api_server.py:1695
Output serving /output/<client>/<filename> api_server.py:2121
Output listing /api/output_files api_server.py:2168
Output folder ensure_client_output_folder() api_server.py:856
Profile loading profile_config.get_profile() profile_config.py:219
Profile visibility client_config.get_profiles_with_visibility() client_config.py:82
Usage logging usage_tracker.log_analysis_start/complete() usage_tracker.py:73,100

Future deterministic doc-mode checks should follow _run_dj_file_naming_check() (api_server.py:348) — short-circuit at the top of process_single_check before any LLM dispatch.