First real-data test against the AXA car-insurance PDFs surfaced a
noise problem: the new document is a brand refresh — every page flips
font (PublicoBanner-Bold→PublicoHeadline-Bold) and colour
(#893f4a→#2e3092). At medium-per-finding that crashed the diff score
to 0.0 and drowned the bold-regression signal AXA actually flagged.
Drop font, size, colour comparators. Keep bold + italic — the
attributes the vision-LLM consistently misses on dense layouts. The
LLM already narrates colour-scheme rebrands and font swaps in its
Modified / Style-changes blocks; running both layers on the same
visual change just double-counts it.
Tests inverted from "X change is flagged" to "X change is NOT
flagged" to lock the scope decision in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If the deterministic formatting comparator raises on any single page-pair
(e.g. unexpected span shape from a future PyMuPDF version), degrade to
zero formatting findings for that pair instead of aborting the whole
52-page diff run. Logged for visibility.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously every boolean attribute rendered as "Bold → Regular",
producing "Italic: Bold → Regular" for italic flips. Now the helper
takes the attribute name and emits "Italic → Regular" or
"Bold → Regular" depending on which boolean attribute is being shown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "🎨 Formatting changes" block to the per-page diff report
when the deterministic formatting layer finds typographic flips.
Distinguishes page-wide style shifts from local span flips, lists up
to three example quotes per aggregated finding, and HTML-escapes all
user-controlled strings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
run_page_pair_diff now invokes compute_formatting_diff alongside the
LLM call for each aligned pair. When the deterministic layer finds
typographic flips on a page the LLM saw as identical, the pair is
re-classified as having differences with medium severity. Each
aggregated finding contributes to the global medium-severity tally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three review-driven hardening tweaks:
- page_wide now requires ≥3 matched spans (PAGE_WIDE_MIN_SPANS).
Avoids labelling section-break pages with a single flipped heading
as page-wide.
- _collect_flips normalises bold/italic via bool() and font/color
via "or ''" so callers passing dicts without those keys do not
produce phantom flips against False/''.
- Adds tests for empty span lists and the missing-bold-key case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New formatting_diff module compares span-level bold/italic/font/size/
color attributes between aligned page-pairs. Pure-Python; reads
PyMuPDF metadata already captured during ingest. Aggregates identical
flips into single findings and flags page-wide style shifts.
Powers the AXA document_diff fix for missed formatting changes that
the vision-LLM does not reliably detect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 'color' field to each span dict extracted by
_extract_page_spans. Powers the upcoming deterministic
formatting-diff layer for AXA document_diff mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 4-page Boots PPack run (7 page-scoped checks) was taking ~15 min
because the dispatcher processed pages sequentially within each
check — 28 Gemini calls in a single file. Asset-mode's
ThreadPoolExecutor parallelism was bypassed because doc-mode called
process_checks_in_batches once per page in a loop.
Wrap the per-page dispatch in both Stage 3c (page_sample) and Stage
3d (page_each) with a ThreadPoolExecutor (max_workers=4). Extract
the per-page work into a single nested helper used by both stages,
which also tags each result with page_type so the existing artwork
vs informational aggregation in Stage 3d keeps working. Aggregation
logic, scoring, strict-grade override, and report shape are all
unchanged.
process_checks_in_batches is already reentrant (asset-mode uses it
under its own internal ThreadPoolExecutor), so concurrent calls are
safe. Progress-tracker writes intentionally tolerate races (visual
only). Per-page exceptions are caught inside the helper so one bad
page doesn't kill the doc — it just records a score-0 result.
Expected: 15 min → ~3-4 min on the same 4-page PDF. Needs wall-time
confirmation on dev with a real run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AXA's accessibility QC team uses axes4 PAC (PDF/UA-1 / Matterhorn Protocol)
as their compliance gate, but our existing 9-criterion deterministic check
runs surface-level only and would pass documents PAC fails. Wired up the
existing _run_verapdf() stub so veraPDF — the open-source Matterhorn
implementation — runs as a subprocess and drives the score when available.
Verified locally: veraPDF on EAA_v1.pdf reports the exact same Content (86)
and Metadata (1) failure counts as PAC's report on the same document family,
confirming protocol parity.
Falls back cleanly to the deterministic layer when veraPDF isn't installed,
so deploys are safe before the binary lands on dev/prod servers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New profile boots_ppack for QCing multi-page Boots production packs
(PowerPoint-exported PDFs, 4-18 pages each). Built on top of AXA's
document-mode infrastructure — branched off feature/axa-document-mode
because it reuses the dispatcher, ingest, and result writer.
New checks:
- boots_logo_compliance — three-path scoring (master wordmark / partner
lock-up / no branding) so OLIVER x BOOTS-style footer lock-ups aren't
scored against master wordmark rules. Conservative without a formal
Boots logo guideline.
- boots_colour_palette — verifies CMYK/RGB/Hex spec values on creative-
guidance pages against canonical Boots Blue / Health Primary Blue /
Offer Red, plus visual sanity-check on artwork pages.
Existing checks tuned:
- boots_brand_name_accuracy: closed-world list semantics. Brands not on
the approved list now go to names_not_on_list (manual review) instead
of failing — the list is sourced from the original 7 docs and is known
incomplete (Remington, Imodium, Maybelline etc. are legitimate Boots-
stocked brands not on it).
- boots_tandc_wording: explicit font-weight caveat — Boots Sharp Regular
vs Light isn't reliably distinguishable by vision LLM at small sizes.
Surfaced via font_weight_caveat field + needs_manual_check value.
Page classifier (document_mode/page_classifier.py):
Heuristic tags each page as cover / checklist / palette / notes /
artwork. Validated on all 10 sample packs.
Strict-grade exemption (Profile.strict_grade flag):
Only artwork-classified pages count towards Pass/Fail. Cover, checklist,
palette, and notes pages are still QC'd and reported as Informational
but cannot trigger a Fail. Banner shows exactly which artwork-page
checks fell below 6.
Result writer extended:
- Per-page table with score + page_type pill for any page_each-scope
check (auto-applied as fallback)
- Strict-grade banner (red on violation, green when clean)
- Page_type pills throughout the per-page strip
Smoke-test result (Remington 4-page pack, 2026-05-05):
Overall 70.75/100, strict-grade Fail. After two iterations of prompt
tuning, all three remaining strict-grade violations are real catches:
orphan asterisk in T&Cs, "they may not be stocked" wording deviation,
missing "Charges may apply". brand_name_accuracy 7.0 (was 3.0 before
list fix), logo_compliance 9.5 (was 1.5 before lock-up path fix).
Local-only — not pushed to dev or merged to develop until after Boots
show-and-tell. Same posture as feature/axa-document-mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>