ai_qc

Author	SHA1	Message	Date
nickviljoen	29ee941037	refactor(formatting_diff): narrow scope to bold + italic only First real-data test against the AXA car-insurance PDFs surfaced a noise problem: the new document is a brand refresh — every page flips font (PublicoBanner-Bold→PublicoHeadline-Bold) and colour (#893f4a→#2e3092). At medium-per-finding that crashed the diff score to 0.0 and drowned the bold-regression signal AXA actually flagged. Drop font, size, colour comparators. Keep bold + italic — the attributes the vision-LLM consistently misses on dense layouts. The LLM already narrates colour-scheme rebrands and font swaps in its Modified / Style-changes blocks; running both layers on the same visual change just double-counts it. Tests inverted from "X change is flagged" to "X change is NOT flagged" to lock the scope decision in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 12:37:19 +02:00
nickviljoen	0fd6a35562	fix(diff_report): _fmt_value labels italic flips correctly Previously every boolean attribute rendered as "Bold → Regular", producing "Italic: Bold → Regular" for italic flips. Now the helper takes the attribute name and emits "Italic → Regular" or "Bold → Regular" depending on which boolean attribute is being shown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:22:39 +02:00
nickviljoen	7eaac85df3	feat(diff_report): render formatting_changes as a per-pair block Adds a "🎨 Formatting changes" block to the per-page diff report when the deterministic formatting layer finds typographic flips. Distinguishes page-wide style shifts from local span flips, lists up to three example quotes per aggregated finding, and HTML-escapes all user-controlled strings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:08:47 +02:00
nickviljoen	2b1bb9ccf0	feat(diff_engine): merge formatting_diff findings into pair_diffs run_page_pair_diff now invokes compute_formatting_diff alongside the LLM call for each aligned pair. When the deterministic layer finds typographic flips on a page the LLM saw as identical, the pair is re-classified as having differences with medium severity. Each aggregated finding contributes to the global medium-severity tally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:03:54 +02:00
nickviljoen	d21a8a276d	refactor(formatting_diff): harden page_wide threshold + None-key handling Three review-driven hardening tweaks: - page_wide now requires ≥3 matched spans (PAGE_WIDE_MIN_SPANS). Avoids labelling section-break pages with a single flipped heading as page-wide. - _collect_flips normalises bold/italic via bool() and font/color via "or ''" so callers passing dicts without those keys do not produce phantom flips against False/''. - Adds tests for empty span lists and the missing-bold-key case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:01:23 +02:00
nickviljoen	98679e7329	feat(document_mode): add deterministic span formatting diff New formatting_diff module compares span-level bold/italic/font/size/ color attributes between aligned page-pairs. Pure-Python; reads PyMuPDF metadata already captured during ingest. Aggregates identical flips into single findings and flags page-wide style shifts. Powers the AXA document_diff fix for missed formatting changes that the vision-LLM does not reliably detect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:56:34 +02:00
nickviljoen	f69e181520	feat(ingest): capture span color as #rrggbb string Adds a 'color' field to each span dict extracted by _extract_page_spans. Powers the upcoming deterministic formatting-diff layer for AXA document_diff mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:45:21 +02:00

7 commits