refactor(formatting_diff): narrow scope to bold + italic only

First real-data test against the AXA car-insurance PDFs surfaced a noise problem: the new document is a brand refresh — every page flips font (PublicoBanner-Bold→PublicoHeadline-Bold) and colour (#893f4a→#2e3092). At medium-per-finding that crashed the diff score to 0.0 and drowned the bold-regression signal AXA actually flagged. Drop font, size, colour comparators. Keep bold + italic — the attributes the vision-LLM consistently misses on dense layouts. The LLM already narrates colour-scheme rebrands and font swaps in its Modified / Style-changes blocks; running both layers on the same visual change just double-counts it. Tests inverted from "X change is flagged" to "X change is NOT flagged" to lock the scope decision in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(diff_engine): guard compute_formatting_diff against per-pair failure
2026-05-19 12:37:19 +02:00 · 2026-05-19 10:31:16 +02:00 · 2026-05-19 10:23:15 +02:00 · 2026-05-19 10:22:39 +02:00 · 2026-05-19 10:08:47 +02:00 · 2026-05-19 10:03:54 +02:00
11 changed files with 1921 additions and 3 deletions
--- a/CLAUDE_AXA.md
+++ b/CLAUDE_AXA.md
@ -53,14 +53,14 @@ Boots Production Pack reuses this entire spine — so any infra changes here aff

 ## AI usage across AXA tools

-For client-facing context: **8 of 9 AXA tools are deterministic** (no LLM, $0 cost, runs in seconds). Only `axa_pdf_diff` uses AI — Gemini 2.5 Pro vision-LLM page-pair comparison at ~$0.40-0.80 per pair. The accessibility check uses veraPDF, which is a rule-based open-source PDF/UA-1 validator — not AI. This framing matters when clients conflate "automation" with "AI".
+For client-facing context: **8 of 9 AXA tools are deterministic** (no LLM, $0 cost, runs in seconds). Only `axa_pdf_diff` uses AI — Gemini 2.5 Pro vision-LLM page-pair comparison at ~$0.40-0.80 per pair, supplemented by a deterministic PyMuPDF span comparator that catches bold/italic flips the vision-LLM misses (font/size/colour changes are left to the LLM narrative diff — flagging them deterministically drowns out the bold/italic regressions on re-branded documents). The accessibility check uses veraPDF, which is a rule-based open-source PDF/UA-1 validator — not AI. This framing matters when clients conflate "automation" with "AI".

 | Tool | Type | Engine |
 |---|---|---|
 | `axa_font_inventory`, `axa_phone_inventory`, `axa_bold_words_definitions`, `axa_page_numbering`, `axa_print_code`, `axa_omg_versioning` | Deterministic | PyMuPDF (text + font extraction, regex) |
 | `axa_print_preflight` | Deterministic | PyMuPDF (page geometry, image colour spaces, DPI, transparency, PDF/X) |
 | `axa_pdf_accessibility` | Deterministic (rule-based) | veraPDF subprocess (PDF/UA-1 / Matterhorn Protocol) + PyMuPDF fallback |
-| `axa_pdf_diff` | **AI** | Gemini 2.5 Pro vision-LLM, page-pair diff |
+| `axa_pdf_diff` | **AI + deterministic** | Gemini 2.5 Pro vision-LLM (content + font/size/colour narrative) + PyMuPDF span comparator (bold/italic flip detection) |

 ## Open items

--- a/backend/document_mode/diff_engine.py
+++ b/backend/document_mode/diff_engine.py
@ -26,6 +26,8 @@ from typing import Dict, List, Optional, Tuple

 from PIL import Image

+from document_mode.formatting_diff import compute_formatting_diff
+

 # Similarity threshold for considering two pages "the same page modified"
 # vs "an inserted/removed page". Tuned for policy docs where page-level text
@ -311,6 +313,31 @@ def run_page_pair_diff(
        if not old_p or not new_p or not old_p.get('image_path') or not new_p.get('image_path'):
            return entry, None
        result = _diff_one_pair(old_p, new_p, call_gemini_vision_fn, model_version)
+
+        # Deterministic formatting diff — runs alongside the LLM diff.
+        # Guard so a single bad span on one page doesn't abort the whole run.
+        try:
+            fmt = compute_formatting_diff(
+                old_p.get('spans') or [],
+                new_p.get('spans') or [],
+                old_p['page_num'],
+                new_p['page_num'],
+            )
+        except Exception as fmt_err:
+            print(f"  [formatting_diff] page {old_p['page_num']}->{new_p['page_num']} failed: {fmt_err}")
+            fmt = {'formatting_changes': [], 'finding_count': 0}
+        diff = result.setdefault('diff', {})
+        diff['formatting_changes'] = fmt['formatting_changes']
+        if fmt['finding_count'] > 0:
+            # If the LLM saw the page as identical but the deterministic
+            # layer found typographic flips, we still need the report to
+            # render the pair as "has changes".
+            diff['differences_found'] = True
+            # Each aggregated finding contributes one medium severity entry.
+            # Bump the pair's overall severity to at least 'medium' so the
+            # pair-card pill reflects the finding count.
+            if diff.get('severity') in (None, 'none'):
+                diff['severity'] = 'medium'
        return entry, result

    with concurrent.futures.ThreadPoolExecutor(max_workers=parallel_pairs) as pool:
@ -345,6 +372,16 @@ def run_page_pair_diff(
        sev = d['diff'].get('severity') or 'none'
        if sev in severity_counts:
            severity_counts[sev] += 1
+        # Each formatting-change finding counts as an additional medium entry,
+        # so a page with N findings contributes N+1 mediums (the +1 from the
+        # base severity already counted above, N more from the findings).
+        fmt_findings = d['diff'].get('formatting_changes') or []
+        if fmt_findings:
+            # The base severity was already bumped to >= medium in _run when
+            # findings exist; here we add the additional findings minus the
+            # one already accounted for.
+            extra = max(0, len(fmt_findings) - 1)
+            severity_counts['medium'] += extra

    return {
        'alignment': alignment,
--- a/backend/document_mode/diff_report_writer.py
+++ b/backend/document_mode/diff_report_writer.py
@ -75,6 +75,48 @@ def _render_diff_list(items: List[str], css_class: str, label: str, icon: str) -
    """


+def _render_formatting_block(findings: List[Dict]) -> str:
+    if not findings:
+        return ''
+
+    def _fmt_value(v, attribute):
+        if isinstance(v, bool):
+            if attribute == 'italic':
+                return 'Italic' if v else 'Regular'
+            return 'Bold' if v else 'Regular'
+        return str(v)
+
+    items = []
+    for f in findings:
+        attr = f.get('attribute', '')
+        old_v = _fmt_value(f.get('old_value'), attr)
+        new_v = _fmt_value(f.get('new_value'), attr)
+        total = f.get('total_span_count', 0)
+        page_wide = f.get('page_wide', False)
+        quotes = f.get('example_quotes', []) or []
+
+        if page_wide:
+            prefix = f"<strong>Page-wide {html.escape(attr)} change</strong>: {html.escape(old_v)} → {html.escape(new_v)}"
+        else:
+            prefix = f"<strong>{html.escape(attr).capitalize()}: {html.escape(old_v)} → {html.escape(new_v)}</strong>"
+
+        quote_html = ''
+        if quotes:
+            quoted = ', '.join(f'&ldquo;{html.escape(q)}&rdquo;' for q in quotes[:3])
+            extra = total - len(quotes[:3])
+            extra_html = f" <span class='muted'>…and {extra} more</span>" if extra > 0 else ''
+            quote_html = f" ({total} span{'s' if total != 1 else ''}): {quoted}{extra_html}"
+
+        items.append(f"<li>{prefix}{quote_html}</li>")
+
+    return f"""
+    <div class='diff-block block-style'>
+        <div class='diff-label'>🎨 Formatting changes</div>
+        <ul>{''.join(items)}</ul>
+    </div>
+    """
+
+
 def _render_pair_card(entry: Dict, pair_diffs: Dict) -> str:
    old = entry['old_page']
    new = entry['new_page']
@ -132,6 +174,7 @@ def _render_pair_card(entry: Dict, pair_diffs: Dict) -> str:
    blocks.append(_render_diff_list(pair.get('modified') or [], 'block-modified', 'Modified', '✎'))
    blocks.append(_render_diff_list(pair.get('moved') or [], 'block-moved', 'Moved', '↔'))
    blocks.append(_render_diff_list(pair.get('style_changes') or [], 'block-style', 'Style changes', '🎨'))
+    blocks.append(_render_formatting_block(pair.get('formatting_changes') or []))

    error_block = ''
    if pair.get('error'):
--- a/backend/document_mode/formatting_diff.py
+++ b/backend/document_mode/formatting_diff.py
@ -0,0 +1,168 @@
+"""Deterministic span-level formatting diff for one aligned page-pair.
+
+Companion to diff_engine's vision-LLM diff. The LLM is reliable for
+content/narrative changes (added paragraphs, rewords, layout shifts,
+colour-scheme rebrands) but unreliable for bold/italic flips on dense
+typeset layouts — which is exactly what AXA flagged was being missed.
+
+Scope intentionally narrow: bold + italic flips only. Font / size /
+colour are NOT compared here. A re-export from a different toolchain
+(or a brand refresh) routinely flips font names and colour values on
+every page; reporting those as per-page deductions drowns out the
+bold/italic regressions clients actually need to spot. The LLM
+narrates those rebrand changes already.
+
+Public surface: compute_formatting_diff(old_spans, new_spans,
+old_page_num, new_page_num) -> dict.
+"""
+
+from __future__ import annotations
+
+from collections import defaultdict
+from typing import Dict, List, Tuple
+
+# Spans shorter than this (after .strip()) are ignored. "the", "of", "1",
+# "." are too common to match reliably and would produce noise.
+MIN_TEXT_LEN = 4
+
+# Number of example quotes per aggregated finding shown in the report.
+MAX_QUOTES_PER_FINDING = 3
+
+# A finding only qualifies as "page-wide" if the page has enough matched
+# spans to make that statement meaningful. Section-break pages with one
+# or two long spans should not be labelled page-wide on a single flip.
+PAGE_WIDE_MIN_SPANS = 3
+
+
+def compute_formatting_diff(
+    old_spans: List[Dict],
+    new_spans: List[Dict],
+    old_page_num: int,
+    new_page_num: int,
+) -> Dict:
+    """Compare two span lists and return aggregated formatting changes.
+
+    Scope intentionally limited to bold + italic flips. Font, size and
+    colour changes (rebrands, re-exports from a different toolchain) are
+    handled by the vision-LLM's narrative diff — re-flagging them here
+    drowns out the bold/italic regressions clients actually care about.
+
+    Returns:
+        {
+            'formatting_changes': [
+                {
+                    'attribute': 'bold' | 'italic',
+                    'old_value': bool,
+                    'new_value': bool,
+                    'example_quotes': [str, ...],
+                    'total_span_count': int,
+                    'page_wide': bool,
+                },
+                ...
+            ],
+            'finding_count': int,
+            'severity': 'medium' | 'none',
+            'old_page_num': int,
+            'new_page_num': int,
+        }
+    """
+    pairs = _match_spans(old_spans, new_spans)
+    matched_count = len(pairs)
+    flips = _collect_flips(pairs)
+    findings = _aggregate(flips, matched_count)
+
+    return {
+        'formatting_changes': findings,
+        'finding_count': len(findings),
+        'severity': 'medium' if findings else 'none',
+        'old_page_num': old_page_num,
+        'new_page_num': new_page_num,
+    }
+
+
+def _match_spans(old_spans: List[Dict], new_spans: List[Dict]) -> List[Tuple[Dict, Dict]]:
+    """Pair spans across pages by exact-text match, disambiguated by y-position.
+
+    Spans with fewer than MIN_TEXT_LEN chars after stripping are skipped.
+    Returns a list of (old_span, new_span) tuples.
+    """
+    new_by_text: Dict[str, List[Dict]] = defaultdict(list)
+    for s in new_spans:
+        text = (s.get('text') or '').strip()
+        if len(text) < MIN_TEXT_LEN:
+            continue
+        new_by_text[text].append(s)
+
+    pairs: List[Tuple[Dict, Dict]] = []
+    consumed: set = set()
+    for old_span in old_spans:
+        text = (old_span.get('text') or '').strip()
+        if len(text) < MIN_TEXT_LEN:
+            continue
+        candidates = [c for c in new_by_text.get(text, []) if id(c) not in consumed]
+        if not candidates:
+            continue
+        if len(candidates) == 1:
+            chosen = candidates[0]
+        else:
+            chosen = min(candidates, key=lambda c: abs(_y_mid(c) - _y_mid(old_span)))
+        consumed.add(id(chosen))
+        pairs.append((old_span, chosen))
+
+    return pairs
+
+
+def _y_mid(span: Dict) -> float:
+    """Vertical midpoint of a span's bbox; 0.0 if bbox is missing."""
+    bbox = span.get('bbox') or (0, 0, 0, 0)
+    return (bbox[1] + bbox[3]) / 2.0
+
+
+def _collect_flips(pairs: List[Tuple[Dict, Dict]]) -> List[Dict]:
+    """For each paired span, emit a flip record per bold/italic change."""
+    flips: List[Dict] = []
+    for old_span, new_span in pairs:
+        text = (old_span.get('text') or '').strip()
+        for attr in ('bold', 'italic'):
+            old_v = bool(old_span.get(attr))
+            new_v = bool(new_span.get(attr))
+            if old_v != new_v:
+                flips.append({
+                    'attribute': attr, 'old_value': old_v,
+                    'new_value': new_v, 'quote': text,
+                })
+    return flips
+
+
+def _aggregate(flips: List[Dict], matched_span_count: int) -> List[Dict]:
+    """Group flips by (attribute, old_value, new_value) and emit one finding per group."""
+    groups: Dict[Tuple, List[Dict]] = defaultdict(list)
+    for f in flips:
+        key = (f['attribute'], _hashable(f['old_value']), _hashable(f['new_value']))
+        groups[key].append(f)
+
+    findings: List[Dict] = []
+    for (attribute, _, _), members in groups.items():
+        old_v = members[0]['old_value']
+        new_v = members[0]['new_value']
+        quotes = [m['quote'] for m in members[:MAX_QUOTES_PER_FINDING]]
+        total = len(members)
+        page_wide = matched_span_count >= PAGE_WIDE_MIN_SPANS and total == matched_span_count
+        findings.append({
+            'attribute': attribute,
+            'old_value': old_v,
+            'new_value': new_v,
+            'example_quotes': quotes,
+            'total_span_count': total,
+            'page_wide': page_wide,
+        })
+
+    findings.sort(key=lambda f: -f['total_span_count'])
+    return findings
+
+
+def _hashable(v):
+    """Coerce a value to a hashable form for groupby keys (floats already are)."""
+    if isinstance(v, (str, int, float, bool)) or v is None:
+        return v
+    return str(v)
--- a/backend/document_mode/ingest.py
+++ b/backend/document_mode/ingest.py
@ -55,6 +55,7 @@ def _extract_page_spans(page: fitz.Page) -> List[Dict]:
                text = (span.get('text') or '').strip()
                if not text:
                    continue
+                color_int = span.get('color', 0) or 0
                spans.append({
                    'text': text,
                    'font': span.get('font'),
@ -63,6 +64,7 @@ def _extract_page_spans(page: fitz.Page) -> List[Dict]:
                    'italic': _span_is_italic(span),
                    'bbox': span.get('bbox'),  # (x0, y0, x1, y1) in PDF points
                    'flags': span.get('flags', 0),
+                    'color': f'#{color_int & 0xFFFFFF:06x}',
                })
    return spans

@ -117,7 +119,7 @@ def ingest_pdf(
                    'page_num': 1-indexed int,
                    'image_path': str,
                    'raw_text': str,
-                    'spans': [{ text, font, size, bold, italic, bbox, flags }, ...],
+                    'spans': [{ text, font, size, bold, italic, color, bbox, flags }, ...],
                    'fonts_used': sorted list of unique font names,
                },
                ...
--- a/backend/tests/test_diff_engine_formatting_integration.py
+++ b/backend/tests/test_diff_engine_formatting_integration.py
@ -0,0 +1,62 @@
+"""Smoke test: run_page_pair_diff merges formatting findings into pair_diffs."""
+
+import pytest
+
+from document_mode.diff_engine import run_page_pair_diff
+
+
+def _page(page_num, raw_text, spans, image_path='/tmp/dummy.png'):
+    return {
+        'page_num': page_num,
+        'raw_text': raw_text,
+        'spans': spans,
+        'image_path': image_path,
+        'fonts_used': [],
+    }
+
+
+def _span(text, bold=False):
+    return {'text': text, 'bold': bold, 'italic': False, 'font': 'Helvetica',
+            'size': 10.0, 'color': '#000000', 'bbox': (0, 10, 100, 22)}
+
+
+def test_formatting_findings_surface_when_llm_returns_identical(tmp_path):
+    # Create real dummy PNGs since _diff_one_pair tries to open them via PIL.
+    from PIL import Image as PILImage
+    img_path = tmp_path / "dummy.png"
+    PILImage.new('RGB', (10, 10)).save(img_path)
+
+    old_pages = [_page(
+        1,
+        "Theft of personal belongings if your car is left unattended unless windows are closed.",
+        [_span("Theft of personal belongings if your car is left unattended", bold=True)],
+        image_path=str(img_path),
+    )]
+    new_pages = [_page(
+        1,
+        "Theft of personal belongings if your car is left unattended unless windows are closed.",
+        [_span("Theft of personal belongings if your car is left unattended", bold=False)],
+        image_path=str(img_path),
+    )]
+
+    # LLM says: no differences. We expect the deterministic layer to override.
+    def fake_llm(prompt, old_img, new_img, model_version=None):
+        return (
+            '{"differences_found": false, "added": [], "removed": [], '
+            '"modified": [], "moved": [], "style_changes": [], '
+            '"severity": "none", "summary": "Identical."}',
+            {'prompt_tokens': 100, 'completion_tokens': 20, 'total_tokens': 120},
+        )
+
+    result = run_page_pair_diff(
+        old_ingest={'pages': old_pages},
+        new_ingest={'pages': new_pages},
+        call_gemini_vision_fn=fake_llm,
+    )
+
+    pair_diff = result['pair_diffs']['1->1']['diff']
+    assert pair_diff['differences_found'] is True
+    assert pair_diff['severity'] == 'medium'
+    assert len(pair_diff['formatting_changes']) == 1
+    assert pair_diff['formatting_changes'][0]['attribute'] == 'bold'
+    assert result['totals']['severity_counts']['medium'] >= 1
--- a/backend/tests/test_diff_report_formatting_block.py
+++ b/backend/tests/test_diff_report_formatting_block.py
@ -0,0 +1,81 @@
+"""Smoke test for the new formatting-changes rendering block."""
+
+from document_mode.diff_report_writer import _render_formatting_block
+
+
+def test_empty_findings_render_nothing():
+    assert _render_formatting_block([]) == ''
+
+
+def test_single_bold_flip_renders_with_quote():
+    findings = [{
+        'attribute': 'bold',
+        'old_value': True,
+        'new_value': False,
+        'example_quotes': ['Theft of personal belongings'],
+        'total_span_count': 1,
+        'page_wide': False,
+    }]
+    html_out = _render_formatting_block(findings)
+    assert '🎨 Formatting changes' in html_out
+    assert 'Theft of personal belongings' in html_out
+    assert 'Bold' in html_out
+    assert 'Regular' in html_out
+    assert 'block-style' in html_out
+
+
+def test_page_wide_flag_changes_label():
+    findings = [{
+        'attribute': 'font',
+        'old_value': 'AXASans-Regular',
+        'new_value': 'Helvetica',
+        'example_quotes': ['Some body text'],
+        'total_span_count': 17,
+        'page_wide': True,
+    }]
+    html_out = _render_formatting_block(findings)
+    assert 'Page-wide font change' in html_out
+
+
+def test_html_escape_in_quotes():
+    findings = [{
+        'attribute': 'bold',
+        'old_value': True,
+        'new_value': False,
+        'example_quotes': ['<script>alert("xss")</script>'],
+        'total_span_count': 1,
+        'page_wide': False,
+    }]
+    html_out = _render_formatting_block(findings)
+    assert '<script>' not in html_out
+    assert '&lt;script&gt;' in html_out
+
+
+def test_aggregated_finding_shows_and_x_more():
+    findings = [{
+        'attribute': 'bold',
+        'old_value': True,
+        'new_value': False,
+        'example_quotes': ['First quote', 'Second quote', 'Third quote'],
+        'total_span_count': 12,
+        'page_wide': False,
+    }]
+    html_out = _render_formatting_block(findings)
+    assert '12 spans' in html_out
+    assert 'and 9 more' in html_out
+
+
+def test_italic_flip_uses_italic_label_not_bold():
+    findings = [{
+        'attribute': 'italic',
+        'old_value': True,
+        'new_value': False,
+        'example_quotes': ['Block quote text'],
+        'total_span_count': 1,
+        'page_wide': False,
+    }]
+    html_out = _render_formatting_block(findings)
+    assert 'Italic' in html_out
+    assert 'Regular' in html_out
+    # Critical: the label "Bold" must NOT appear in an italic flip
+    assert 'Bold' not in html_out
--- a/backend/tests/test_formatting_diff.py
+++ b/backend/tests/test_formatting_diff.py
@ -0,0 +1,210 @@
+"""Unit tests for deterministic per-page-pair formatting diff."""
+
+import pytest
+
+from document_mode.formatting_diff import compute_formatting_diff
+
+
+def _span(text, bold=False, italic=False, font='Helvetica', size=10.0,
+          color='#000000', bbox=(0, 10, 100, 22)):
+    return {
+        'text': text, 'bold': bold, 'italic': italic, 'font': font,
+        'size': size, 'color': color, 'bbox': bbox,
+    }
+
+
+def test_identical_spans_produce_no_findings():
+    spans_a = [_span("Hello world"), _span("Second paragraph")]
+    spans_b = [_span("Hello world"), _span("Second paragraph")]
+
+    result = compute_formatting_diff(spans_a, spans_b, 1, 1)
+
+    assert result['finding_count'] == 0
+    assert result['formatting_changes'] == []
+    assert result['severity'] == 'none'
+
+
+def test_bold_flip_is_detected():
+    spans_a = [_span("Theft of personal belongings", bold=True)]
+    spans_b = [_span("Theft of personal belongings", bold=False)]
+
+    result = compute_formatting_diff(spans_a, spans_b, 18, 18)
+
+    assert result['finding_count'] == 1
+    finding = result['formatting_changes'][0]
+    assert finding['attribute'] == 'bold'
+    assert finding['old_value'] is True
+    assert finding['new_value'] is False
+    assert finding['total_span_count'] == 1
+    assert "Theft of personal belongings" in finding['example_quotes']
+    assert result['severity'] == 'medium'
+
+
+def test_aggregates_identical_flips_into_one_finding():
+    old = [
+        _span("First sentence that lost bold", bold=True),
+        _span("Second sentence that lost bold", bold=True),
+        _span("Third sentence that lost bold", bold=True),
+    ]
+    new = [
+        _span("First sentence that lost bold", bold=False),
+        _span("Second sentence that lost bold", bold=False),
+        _span("Third sentence that lost bold", bold=False),
+    ]
+
+    result = compute_formatting_diff(old, new, 22, 22)
+
+    assert result['finding_count'] == 1
+    finding = result['formatting_changes'][0]
+    assert finding['total_span_count'] == 3
+    assert len(finding['example_quotes']) == 3
+    assert finding['page_wide'] is True
+
+
+def test_page_wide_flag_false_when_only_subset_flips():
+    old = [
+        _span("Lost its bold", bold=True),
+        _span("Stays regular and matches text", bold=False),
+    ]
+    new = [
+        _span("Lost its bold", bold=False),
+        _span("Stays regular and matches text", bold=False),
+    ]
+
+    result = compute_formatting_diff(old, new, 5, 5)
+
+    assert result['finding_count'] == 1
+    assert result['formatting_changes'][0]['page_wide'] is False
+
+
+def test_short_text_spans_are_ignored():
+    old = [_span("of", bold=True), _span("the", bold=True)]
+    new = [_span("of", bold=False), _span("the", bold=False)]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 0
+
+
+def test_unmatched_text_is_ignored_not_flagged():
+    old = [_span("Original sentence that was bold", bold=True)]
+    new = [_span("Completely different replacement copy", bold=False)]
+
+    result = compute_formatting_diff(old, new, 7, 7)
+
+    assert result['finding_count'] == 0
+
+
+def test_size_change_not_flagged():
+    # Size is intentionally out of scope — rebrand re-exports often change
+    # body-text point sizes by fractions of a point.
+    old = [_span("Body text resized", size=10.00)]
+    new = [_span("Body text resized", size=12.50)]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 0
+
+
+def test_font_change_not_flagged():
+    # Font swap is intentionally out of scope — caught by the LLM narrative
+    # diff. Reporting it here would drown out bold/italic regressions on
+    # re-branded documents.
+    old = [_span("Body text in original font face", font='AXASans-Regular')]
+    new = [_span("Body text in original font face", font='Helvetica')]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 0
+
+
+def test_color_change_not_flagged():
+    # Colour is intentionally out of scope for the same rebrand-noise reason.
+    old = [_span("Hyperlink-style text in blue", color='#0066cc')]
+    new = [_span("Hyperlink-style text in blue", color='#000000')]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 0
+
+
+def test_italic_flip_detected():
+    old = [_span("Block quote that was italicised", italic=True)]
+    new = [_span("Block quote that was italicised", italic=False)]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 1
+    assert result['formatting_changes'][0]['attribute'] == 'italic'
+
+
+def test_duplicate_text_disambiguated_by_y_position():
+    old = [
+        _span("Important note", bold=True, bbox=(72, 100, 200, 115)),
+        _span("Important note", bold=True, bbox=(72, 700, 200, 715)),
+    ]
+    new = [
+        _span("Important note", bold=False, bbox=(72, 100, 200, 115)),
+        _span("Important note", bold=True, bbox=(72, 700, 200, 715)),
+    ]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 1
+    assert result['formatting_changes'][0]['total_span_count'] == 1
+
+
+def test_single_span_page_not_labelled_page_wide():
+    # A page with only one matched span that flipped should NOT be page-wide,
+    # even though "all" matched spans flipped — the count is too small.
+    old = [_span("Sole heading on this section-break page", bold=True)]
+    new = [_span("Sole heading on this section-break page", bold=False)]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 1
+    assert result['formatting_changes'][0]['page_wide'] is False
+
+
+def test_two_span_page_not_labelled_page_wide():
+    # Threshold is 3 — 2 spans flipping is not enough to call page-wide.
+    old = [
+        _span("First short heading", bold=True),
+        _span("Second short heading", bold=True),
+    ]
+    new = [
+        _span("First short heading", bold=False),
+        _span("Second short heading", bold=False),
+    ]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 1
+    assert result['formatting_changes'][0]['page_wide'] is False
+
+
+def test_missing_bold_key_treated_as_false_no_phantom_flip():
+    # A span dict that omits 'bold' entirely should be treated as bold=False
+    # for comparison purposes — not as None, which would falsely flip vs False.
+    old = [{'text': "Body text from older ingest path", 'italic': False,
+            'font': 'Helvetica', 'size': 10.0, 'color': '#000000',
+            'bbox': (0, 10, 100, 22)}]
+    new = [{'text': "Body text from older ingest path", 'bold': False,
+            'italic': False, 'font': 'Helvetica', 'size': 10.0,
+            'color': '#000000', 'bbox': (0, 10, 100, 22)}]
+
+    result = compute_formatting_diff(old, new, 1, 1)
+
+    assert result['finding_count'] == 0
+
+
+def test_empty_old_spans_returns_no_findings():
+    result = compute_formatting_diff([], [_span("Some new text")], 1, 1)
+    assert result['finding_count'] == 0
+    assert result['severity'] == 'none'
+
+
+def test_empty_new_spans_returns_no_findings():
+    result = compute_formatting_diff([_span("Some old text")], [], 1, 1)
+    assert result['finding_count'] == 0
+    assert result['severity'] == 'none'
--- a/backend/tests/test_ingest_color.py
+++ b/backend/tests/test_ingest_color.py
@ -0,0 +1,43 @@
+"""Verifies ingest._extract_page_spans surfaces span color as a '#rrggbb' string."""
+
+import fitz
+import pytest
+
+from document_mode.ingest import _extract_page_spans
+
+
+def _make_one_page_pdf_with_colored_text(tmp_path):
+    """Build a 1-page PDF with two spans: a black one and a red one."""
+    pdf_path = tmp_path / "colored.pdf"
+    doc = fitz.open()
+    page = doc.new_page()
+    page.insert_text((72, 72), "Black text", fontsize=12, color=(0, 0, 0))
+    page.insert_text((72, 100), "Red text", fontsize=12, color=(1, 0, 0))
+    doc.save(str(pdf_path))
+    doc.close()
+    return str(pdf_path)
+
+
+def test_extract_page_spans_includes_color_field(tmp_path):
+    pdf_path = _make_one_page_pdf_with_colored_text(tmp_path)
+    doc = fitz.open(pdf_path)
+    spans = _extract_page_spans(doc[0])
+    doc.close()
+
+    assert spans, "expected at least one span"
+    for span in spans:
+        assert 'color' in span, f"span missing 'color' key: {span}"
+        assert isinstance(span['color'], str), f"color should be string: {span['color']!r}"
+        assert span['color'].startswith('#'), f"color should be hex: {span['color']!r}"
+        assert len(span['color']) == 7, f"color should be #rrggbb: {span['color']!r}"
+
+
+def test_extract_page_spans_red_text_has_red_color(tmp_path):
+    pdf_path = _make_one_page_pdf_with_colored_text(tmp_path)
+    doc = fitz.open(pdf_path)
+    spans = _extract_page_spans(doc[0])
+    doc.close()
+
+    red_spans = [s for s in spans if 'Red' in s['text']]
+    assert red_spans, "expected to find the 'Red text' span"
+    assert red_spans[0]['color'] == '#ff0000', f"got {red_spans[0]['color']}"
--- a/docs/superpowers/plans/2026-05-19-axa-formatting-diff.md
+++ b/docs/superpowers/plans/2026-05-19-axa-formatting-diff.md
--- a/docs/superpowers/specs/2026-05-19-axa-formatting-diff-design.md
+++ b/docs/superpowers/specs/2026-05-19-axa-formatting-diff-design.md
@ -0,0 +1,266 @@
+# AXA Diff — Deterministic Formatting Layer
+
+**Date:** 2026-05-19
+**Client:** AXA (Ireland)
+**Profile affected:** `axa_policy_document_diff`
+**Check affected:** `axa_pdf_diff`
+**Status:** Design
+
+---
+
+## Problem
+
+AXA reported that an updated car-insurance policy (`7274754 - AXA - Car Insurance Policy V1.pdf`) lost bold formatting on words that were bold in the previous version (`axa-car-insurance-policy-011024.pdf`), specifically "from page 18 onwards the text in blue isn't bold in the updated document whereas it is in the old document." The diff report (`axa_pdf_diff`) generated for them on 2026-05-18 missed these formatting changes — pages 18, 19, and several others are marked "identical / No differences detected", despite the bold flips being real.
+
+Inspection of the report confirms the vision-LLM does catch *some* bold changes (page 22 of that report correctly flagged "unattended" gaining bold-blue formatting as high severity). The failure mode is inconsistent recall: Gemini 2.5 Pro on rendered page images is unreliable at distinguishing bold-vs-regular weight, especially on dense policy-document layouts where the stroke-weight difference is subtle.
+
+## Root cause
+
+`document_mode/ingest.py` already extracts per-span `bold`, `italic`, `font`, and `size` metadata via PyMuPDF — ground-truth typographic data straight from the PDF. The current diff pipeline (`document_mode/diff_engine.py`) renders pages to PNGs and asks Gemini to compare images, never consulting the structured span data it already has on disk. We are doing visual diff when we should be doing structural diff for formatting changes.
+
+## Goal
+
+Add a deterministic formatting-diff layer that runs alongside the existing vision-LLM diff for each aligned page-pair, compares span-level typographic attributes between old and new, and emits structured findings that surface in the existing diff report. The LLM diff stays in place for content/narrative changes (its strength); the deterministic layer owns formatting (the LLM's weakness).
+
+## Architecture
+
+```
+                  ┌─────────────────────────────┐
+                  │ run_page_pair_diff (existing)│
+                  └──────────────┬──────────────┘
+                                 │
+                     for each aligned pair:
+                                 │
+            ┌────────────────────┴────────────────────┐
+            ▼                                         ▼
+   _diff_one_pair (LLM)                  compute_formatting_diff (new)
+   - Renders both PNGs                   - Reads spans from ingest meta
+   - Gemini vision call                  - Matches by exact text
+   - Returns added/removed/modified...   - Compares 5-tuple of attrs
+                                         - Aggregates, suppresses noise
+                                         - Returns formatting_changes[]
+            │                                         │
+            └────────────────────┬────────────────────┘
+                                 ▼
+              merged into pair_diffs[key]['diff']
+                                 │
+                                 ▼
+                  diff_report_writer.py renders:
+                  - existing blocks (added/removed/modified/style)
+                  - new block: "🎨 Formatting changes" (block-style)
+```
+
+Eight components ship in this cycle:
+
+1. **`ingest.py` extension** — capture `color` in span dicts. PyMuPDF exposes `span['color']` as a 24-bit RGB int; convert to `#rrggbb` string for downstream readability.
+2. **New module `document_mode/formatting_diff.py`** — single public function `compute_formatting_diff(old_spans, new_spans, old_page_num, new_page_num) -> dict`.
+3. **`diff_engine.run_page_pair_diff` integration** — after the LLM call returns per pair, invoke `compute_formatting_diff` using span data from `old_pages_meta` / `new_pages_meta`, merge findings into `pair_diffs[key]['diff']['formatting_changes']`.
+4. **Severity contribution** — each aggregated formatting finding counts as one `medium` in `severity_counts`, contributing −3 to `_diff_score`.
+5. **`diff_engine.py` payload** — `old_pages_meta` / `new_pages_meta` are extended to include the full span list (currently only `page_num`, `fonts_used`, `image_path`). The formatting-diff layer needs spans; without them we'd re-ingest.
+6. **`diff_report_writer.py` rendering** — new block category "🎨 Formatting changes", uses existing `block-style` CSS class (purple border, already in the stylesheet).
+7. **At-a-glance counter update** — add an aggregated count of formatting-change findings to the existing glance grid (or roll them into the existing `medium` card — see Open question 1 below).
+8. **Test fixtures** — copy the two AXA PDFs into `backend/tests/fixtures/axa_diff/` (gitignored per existing pattern) and add a smoke-test script that produces a report we can diff against the baseline.
+
+## Module design
+
+### `document_mode/formatting_diff.py`
+
+Single public function:
+
+```python
+def compute_formatting_diff(
+    old_spans: list[dict],
+    new_spans: list[dict],
+    old_page_num: int,
+    new_page_num: int,
+) -> dict:
+    """Deterministic formatting comparison for one aligned page-pair.
+
+    Returns:
+        {
+            'formatting_changes': [
+                {
+                    'attribute': 'bold' | 'italic' | 'font' | 'size' | 'color',
+                    'old_value': str | bool | float,
+                    'new_value': str | bool | float,
+                    'example_quotes': [str, ...],     # up to MAX_QUOTES_PER_FINDING
+                    'total_span_count': int,           # total spans aggregated
+                    'page_wide': bool,                 # true if entire page flipped
+                },
+                ...
+            ],
+            'finding_count': int,
+            'severity': 'medium' | 'none',
+        }
+    """
+```
+
+### Matching algorithm
+
+For each `old_span` in `old_spans`:
+
+1. Skip if `len(old_span['text'].strip()) < 4` — single chars and short tokens (`"the"`, `"of"`, `"1"`, `"."`) are too ambiguous to match reliably.
+2. Find candidates in `new_spans` where the trimmed text matches exactly.
+3. If exactly one candidate: pair them.
+4. If multiple candidates: pick the one whose y-midpoint (`(bbox[1] + bbox[3]) / 2`) is closest to the old span's y-midpoint. PDF coordinate systems are consistent across a re-typeset doc; this is a reliable tie-break.
+5. If zero candidates: ignore this span entirely (content rewrite — the LLM diff owns those).
+
+For each paired `(old_span, new_span)`, compare the 5-tuple `(bold, italic, font, size, color)`. Any attribute that differs becomes a *flip record*.
+
+### Aggregation
+
+Flip records are aggregated into findings keyed by `(attribute, old_value, new_value)`. So 12 spans on a page that all flipped from `bold=True → bold=False` become **one** finding with `total_span_count=12` and up to `MAX_QUOTES_PER_FINDING=3` example quotes. This keeps the report readable for large flips and bounds severity-score impact (one medium per finding, not twelve).
+
+### Page-wide suppression
+
+If `total_span_count` for a finding equals the count of *matched* spans on the page (i.e. every comparable span on this page has the same flip), mark `page_wide=True` and render the finding with a "Page-wide change: …" prefix. This catches the case where a doc-wide style change (e.g. body font renamed in a re-export) would otherwise produce one finding per matched span across every page.
+
+A future enhancement could elevate page-wide changes that span multiple pages into a document-level summary, but for this cycle each page emits its own finding independently.
+
+### Noise suppression — size
+
+Span `size` is rounded to 2 dp in `ingest.py` (`'size': round(span.get('size', 0), 2)`). Use a tolerance of ±0.05pt when comparing sizes — sub-half-point shifts are typically anti-aliasing artefacts of re-export, not intentional design changes. Bold/italic/font/color use exact equality.
+
+### Color extraction
+
+In `ingest.py:_extract_page_spans`, add to the span dict:
+
+```python
+color_int = span.get('color', 0) or 0
+spans.append({
+    ...,
+    'color': f'#{color_int:06x}',
+})
+```
+
+PyMuPDF returns colour as a packed integer when the span uses a single device colour. Edge cases (gradient text, no-fill text) return 0 — represented as `#000000`. That's acceptable for diff purposes since we only care whether the value flipped, not whether it's semantically meaningful in isolation.
+
+## Pipeline integration
+
+### `diff_engine.run_page_pair_diff`
+
+Today, this function:
+1. Aligns pages between old and new ingest.
+2. For each matched pair, dispatches `_diff_one_pair` (LLM call) into a thread pool.
+3. Accumulates `pair_diffs` keyed by `f"{old_page}->{new_page}"`.
+
+After the future change:
+1. Alignment unchanged.
+2. After the LLM call returns for a pair, the same thread calls `compute_formatting_diff` (CPU-only, ~1–5 ms per page) and merges its `formatting_changes` array into `pair_diffs[key]['diff']`.
+3. `severity_counts['medium']` is incremented by `finding_count` from the deterministic layer.
+4. The `differences_found` flag on each pair becomes `True` if the LLM saw differences *or* the formatting diff returned ≥1 finding. This stops the report from rendering a page as "identical" when only the deterministic layer caught the change.
+
+### Span data plumbing
+
+`run_document_diff_analysis` currently returns `old_pages_meta` / `new_pages_meta` with `page_num`, `fonts_used`, `image_path` only. Extend both to include `spans` (list of span dicts). Memory cost: ~10–50 KB per page for a typical policy doc; acceptable in-process and not written to disk.
+
+Alternative considered and rejected: re-ingest both PDFs from within `formatting_diff.py`. Cleaner separation but doubles I/O and PyMuPDF parse cost. Pass-through is simpler.
+
+## Score impact
+
+`_diff_score` already deducts 3 points per medium-severity entry in `severity_counts['medium']`. With the deterministic layer feeding into the same bucket, no formula change is required. A doc with 8 formatting findings would lose 24 points from those alone; combined with whatever the LLM found, the score reflects total change volume across both layers.
+
+If a future round wants to weight deterministic findings separately (e.g. less than LLM-flagged changes since they're often cosmetic), the cleanest place is to add a `formatting_change_weight` constant to `_diff_score`. Out of scope for this cycle.
+
+## Rendering
+
+### Report block
+
+In `diff_report_writer.py`, after the existing `style_changes` rendering block, add:
+
+```python
+if formatting_changes:
+    blocks.append(_render_formatting_block(formatting_changes))
+```
+
+Block markup:
+
+```html
+<div class='diff-block block-style'>
+  <div class='diff-label'>🎨 Formatting changes</div>
+  <ul>
+    <li>
+      <strong>Bold → Regular</strong> (12 spans across page):
+      "Theft of personal belongings if your car is left unattended…",
+      "Theft of push chairs, prams, buggies…",
+      "Damage caused to your car while…"
+      <span class='muted'>…and 9 more</span>
+    </li>
+    <li>
+      <strong>Page-wide font change</strong>: AXASans-Bold → AXASans-Regular
+    </li>
+  </ul>
+</div>
+```
+
+`block-style` CSS class already exists (purple left border). HTML-escape all example_quotes via `html.escape()`. Mirror the per-span HTML escape pattern already used elsewhere in the writer.
+
+### Glance grid
+
+Open question 1 (see below) — Nick to decide whether formatting findings get their own glance card or fold into the existing medium counter. Default for spec-writing is **fold into medium**, since each finding already counts as one medium severity.
+
+## Testing approach
+
+### Fixture-based smoke test
+
+1. Create `backend/tests/fixtures/axa_diff/` (already gitignored per `.gitignore` line `backend/tests/fixtures/`).
+2. Copy the two PDFs from `/Users/nickviljoen/Desktop/AI_QC_Bitbucket/axa_ireland/18_may_test/` into the fixture dir.
+3. Add `backend/tests/test_formatting_diff.py` with:
+   - `test_compute_formatting_diff_returns_empty_for_identical_pages` (load same page from old PDF twice → finding_count == 0).
+   - `test_compute_formatting_diff_detects_bold_flip` (synthetic span data: one span with bold=True in old, bold=False in new → returns one finding with attribute='bold').
+   - `test_aggregates_identical_flips` (3 spans all `bold=True → False` with different text → one finding, total_span_count=3, 3 example_quotes).
+   - `test_page_wide_flag_set_when_all_matched_spans_flip` (synthetic data: all spans flip → `page_wide=True`).
+   - `test_short_text_spans_are_ignored` (span with text="of" flipping bold → no finding).
+   - `test_size_tolerance_within_005pt_not_flagged` (size 10.00 → 10.04: no finding; 10.00 → 10.10: finding).
+
+The fixture-based end-to-end smoke (running the full diff on the two real AXA PDFs) is verified manually by Nick locally before push to dev.
+
+### Manual verification on dev
+
+After push to dev, Nick re-runs the diff with the same two PDFs through the UI. Acceptance criteria:
+- The diff report shows a "🎨 Formatting changes" block on at least pages 18 and 19 (the pages the client flagged).
+- Page 22 (which the LLM already caught) gains a deterministic finding alongside the existing LLM finding — duplication is acceptable.
+- The overall score drops from 12 to a lower number reflecting the additional medium findings.
+- No regressions on pages the LLM correctly flagged with content changes.
+
+## Deployment plan
+
+This work is one of multiple AXA-area changes Nick is shipping today. The cadence:
+
+1. **Local execution and test.** Subagent-driven implementation in the current working directory. Nick tests locally with `./scripts/run-local.sh` and the two PDF fixtures. Iterate until the report shows the expected formatting findings on page 18+.
+2. **Push to develop.** Single commit (or bundled commit with the other AXA changes Nick is making today, if convenient). Nick re-tests on dev via `https://optical-dev.oliver.solutions/ai_qc/`.
+3. **Hold for prod bundle.** No prod deploy from this cycle alone. Nick bundles this with the other AXA changes made today and pushes to prod tonight via the standard `develop → main` PR + tag + `deploy.sh prod <tag>` flow when the prod server can be restarted.
+
+The cycle does **not** end with a prod tag. It ends with a green dev run that Nick approves.
+
+## Out of scope (deferred follow-ups)
+
+1. **Defined-term escalation.** When a bold flip's text matches the AXA bold-words seed list (`backend/document_mode/data/axa_bold_words_seed.json`), escalate to high severity. Easy follow-up; left out to keep this cycle focused.
+2. **Cross-page document-level summary.** If pages 18, 19, 20, 21 all show the same `bold → regular` flip, a top-of-report "Document-wide formatting drift" callout would be more useful than four per-page findings. Defer until we see the real signal pattern.
+3. **Span match by character n-gram for reworded text.** Out of scope; content rewrites are the LLM's job.
+4. **Italic / size / font / color tuning based on first AXA report.** If the all-5-attributes scope produces too much noise on the first real run, we'll tune thresholds in a follow-up. The aggregation + page-wide suppression should keep volume manageable, but real data is the only honest test.
+5. **HTML report download-link change.** Existing report download flow unchanged.
+
+## Open questions
+
+1. **Glance grid card.** Default decision: fold formatting findings into the existing `medium` card (no new card). Alternative: dedicated "Formatting changes" card next to the severity cards. Nick to confirm or override during plan-writing.
+
+That's the only one left. The three brainstorm questions (approach, scope, severity) are locked.
+
+## Files touched
+
+- **Modify:** `backend/document_mode/ingest.py` — extend `_extract_page_spans` to capture color.
+- **Create:** `backend/document_mode/formatting_diff.py` — new module, ~180 lines.
+- **Modify:** `backend/document_mode/diff_engine.py` — extend `old_pages_meta`/`new_pages_meta`; invoke `compute_formatting_diff` per pair in `run_page_pair_diff`; bump `severity_counts['medium']`.
+- **Modify:** `backend/document_mode/diff_report_writer.py` — add formatting-changes rendering block.
+- **Create:** `backend/tests/test_formatting_diff.py` — unit tests for the new module.
+- **Update:** `CLAUDE_AXA.md` — note the deterministic formatting layer in `axa_pdf_diff` row of the AI usage table (now partially deterministic + AI, not pure AI).
+- **Update:** `memory/project_state.md` — note the v1.5.0 (or whatever tag this lands under) shipment when bundle is deployed to prod tonight.
+
+## Acceptance
+
+The cycle is complete when:
+1. Unit tests in `backend/tests/test_formatting_diff.py` pass locally.
+2. Smoke test on dev: the two AXA PDFs produce a diff report with formatting-change findings on the pages the client flagged.
+3. Nick confirms the deterministic findings are accurate (no false positives on the test data).
+4. Code is on `develop` ready to be bundled into the prod-tonight deploy.
Author	SHA1	Message	Date
nickviljoen	29ee941037	refactor(formatting_diff): narrow scope to bold + italic only First real-data test against the AXA car-insurance PDFs surfaced a noise problem: the new document is a brand refresh — every page flips font (PublicoBanner-Bold→PublicoHeadline-Bold) and colour (#893f4a→#2e3092). At medium-per-finding that crashed the diff score to 0.0 and drowned the bold-regression signal AXA actually flagged. Drop font, size, colour comparators. Keep bold + italic — the attributes the vision-LLM consistently misses on dense layouts. The LLM already narrates colour-scheme rebrands and font swaps in its Modified / Style-changes blocks; running both layers on the same visual change just double-counts it. Tests inverted from "X change is flagged" to "X change is NOT flagged" to lock the scope decision in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 12:37:19 +02:00
nickviljoen	d327776c70	fix(diff_engine): guard compute_formatting_diff against per-pair failure If the deterministic formatting comparator raises on any single page-pair (e.g. unexpected span shape from a future PyMuPDF version), degrade to zero formatting findings for that pair instead of aborting the whole 52-page diff run. Logged for visibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:31:16 +02:00
nickviljoen	640bbe4671	docs(axa): note deterministic formatting layer added to axa_pdf_diff Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:23:15 +02:00
nickviljoen	0fd6a35562	fix(diff_report): _fmt_value labels italic flips correctly Previously every boolean attribute rendered as "Bold → Regular", producing "Italic: Bold → Regular" for italic flips. Now the helper takes the attribute name and emits "Italic → Regular" or "Bold → Regular" depending on which boolean attribute is being shown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:22:39 +02:00
nickviljoen	7eaac85df3	feat(diff_report): render formatting_changes as a per-pair block Adds a "🎨 Formatting changes" block to the per-page diff report when the deterministic formatting layer finds typographic flips. Distinguishes page-wide style shifts from local span flips, lists up to three example quotes per aggregated finding, and HTML-escapes all user-controlled strings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:08:47 +02:00
nickviljoen	2b1bb9ccf0	feat(diff_engine): merge formatting_diff findings into pair_diffs run_page_pair_diff now invokes compute_formatting_diff alongside the LLM call for each aligned pair. When the deterministic layer finds typographic flips on a page the LLM saw as identical, the pair is re-classified as having differences with medium severity. Each aggregated finding contributes to the global medium-severity tally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:03:54 +02:00
nickviljoen	d21a8a276d	refactor(formatting_diff): harden page_wide threshold + None-key handling Three review-driven hardening tweaks: - page_wide now requires ≥3 matched spans (PAGE_WIDE_MIN_SPANS). Avoids labelling section-break pages with a single flipped heading as page-wide. - _collect_flips normalises bold/italic via bool() and font/color via "or ''" so callers passing dicts without those keys do not produce phantom flips against False/''. - Adds tests for empty span lists and the missing-bold-key case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:01:23 +02:00
nickviljoen	98679e7329	feat(document_mode): add deterministic span formatting diff New formatting_diff module compares span-level bold/italic/font/size/ color attributes between aligned page-pairs. Pure-Python; reads PyMuPDF metadata already captured during ingest. Aggregates identical flips into single findings and flags page-wide style shifts. Powers the AXA document_diff fix for missed formatting changes that the vision-LLM does not reliably detect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:56:34 +02:00
nickviljoen	f69e181520	feat(ingest): capture span color as #rrggbb string Adds a 'color' field to each span dict extracted by _extract_page_spans. Powers the upcoming deterministic formatting-diff layer for AXA document_diff mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:45:21 +02:00
nickviljoen	25bb472a53	docs: plan for AXA diff deterministic formatting layer Six-task TDD plan implementing the spec at docs/superpowers/specs/2026-05-19-axa-formatting-diff-design.md. Local execution only this cycle — Nick tests locally before any push to dev, and the prod bundle ships tonight alongside other AXA changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:39:40 +02:00
nickviljoen	5e1263380e	docs: spec for AXA diff deterministic formatting layer New formatting_diff module compares PyMuPDF span-level (bold, italic, font, size, color) attributes between aligned page-pairs to catch formatting changes the vision-LLM misses. Addresses AXA client feedback that page 18+ un-bolding of blue text was not surfaced in the diff report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:32:00 +02:00