Previously briefs above ~150 source lines hit the Sonnet 4.6 64k output
cap and were silently truncated. Now we batch:
- ≤70 lines: one LLM call (no change)
- 71-150: batches of 50 (2-3 calls)
- 151+: batches of 40 (unbounded)
Each batch uses Anthropic prompt caching: the V25 system prompt + job
parameters + TM entries + reference data + supplementary files form a
cached prefix; only the per-batch source lines vary. After the first
batch, subsequent batches read the prefix from cache at ~10% input cost,
so an N-batch job costs roughly (1 + 0.1*(N-1)) full prompts instead
of N.
Implementation:
- New LLMClient.create_message_cached / acreate_message_cached methods
that mark system_prompt and cached_user_content with cache_control:
ephemeral. Tracks cache_creation_input_tokens and
cache_read_input_tokens in usage and applies the right cost rates
(1.25x for writes, 0.1x for reads).
- AgentSingle.run() refactored to build the cached static prefix once,
then loop over batches sending only the per-batch source lines as the
dynamic content. Each batch's parsed rows are appended to
context.draft_outputs / ranking_declarations.
- Per-batch instructions added to the prompt for batched runs ("This is
batch N of M ... output a table for these lines only ... do not
repeat prior batches"). Single-call runs (≤70 lines) skip this note.
- Linguistic summary: kept from the last batch (batched mode) or the
single batch (single mode).
- Per-batch logging of input_tokens / cache_read / cache_creation /
output_tokens / stop_reason for visibility.
Verified end-to-end: N=10/70/100/150/250 produce 1/1/2/3/7 LLM calls
with correct draft counts, and live caching reads the cached prefix on
the second call within the 5-minute TTL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>