Commit graph

2 commits

Author SHA1 Message Date
nickviljoen
d21a8a276d refactor(formatting_diff): harden page_wide threshold + None-key handling
Three review-driven hardening tweaks:
- page_wide now requires ≥3 matched spans (PAGE_WIDE_MIN_SPANS).
  Avoids labelling section-break pages with a single flipped heading
  as page-wide.
- _collect_flips normalises bold/italic via bool() and font/color
  via "or ''" so callers passing dicts without those keys do not
  produce phantom flips against False/''.
- Adds tests for empty span lists and the missing-bold-key case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:01:23 +02:00
nickviljoen
98679e7329 feat(document_mode): add deterministic span formatting diff
New formatting_diff module compares span-level bold/italic/font/size/
color attributes between aligned page-pairs. Pure-Python; reads
PyMuPDF metadata already captured during ingest. Aggregates identical
flips into single findings and flags page-wide style shifts.

Powers the AXA document_diff fix for missed formatting changes that
the vision-LLM does not reliably detect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:56:34 +02:00