amazon-transcreation/backend/app
DJP 70cade819c Source-line batching with prompt caching for arbitrarily large briefs
Previously briefs above ~150 source lines hit the Sonnet 4.6 64k output
cap and were silently truncated. Now we batch:

- ≤70 lines:  one LLM call (no change)
- 71-150:     batches of 50 (2-3 calls)
- 151+:       batches of 40 (unbounded)

Each batch uses Anthropic prompt caching: the V25 system prompt + job
parameters + TM entries + reference data + supplementary files form a
cached prefix; only the per-batch source lines vary. After the first
batch, subsequent batches read the prefix from cache at ~10% input cost,
so an N-batch job costs roughly (1 + 0.1*(N-1)) full prompts instead
of N.

Implementation:
- New LLMClient.create_message_cached / acreate_message_cached methods
  that mark system_prompt and cached_user_content with cache_control:
  ephemeral. Tracks cache_creation_input_tokens and
  cache_read_input_tokens in usage and applies the right cost rates
  (1.25x for writes, 0.1x for reads).
- AgentSingle.run() refactored to build the cached static prefix once,
  then loop over batches sending only the per-batch source lines as the
  dynamic content. Each batch's parsed rows are appended to
  context.draft_outputs / ranking_declarations.
- Per-batch instructions added to the prompt for batched runs ("This is
  batch N of M ... output a table for these lines only ... do not
  repeat prior batches"). Single-call runs (≤70 lines) skip this note.
- Linguistic summary: kept from the last batch (batched mode) or the
  single batch (single mode).
- Per-batch logging of input_tokens / cache_read / cache_creation /
  output_tokens / stop_reason for visibility.

Verified end-to-end: N=10/70/100/150/250 produce 1/1/2/3/7 LLM calls
with correct draft counts, and live caching reads the cached prefix on
the second call within the 5-minute TTL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:02:48 -04:00
..
api Round 2 feedback: parser fix, dynamic max_tokens, polling, TM auto-discovery, reviewer comments in export 2026-05-04 16:12:47 -04:00
auth Implement user management: viewer role, real API wiring, admin sidebar 2026-04-15 18:37:16 +01:00
llm Source-line batching with prompt caching for arbitrarily large briefs 2026-05-05 15:02:48 -04:00
models Implement user management: viewer role, real API wiring, admin sidebar 2026-04-15 18:37:16 +01:00
pipeline Source-line batching with prompt caching for arbitrarily large briefs 2026-05-05 15:02:48 -04:00
schemas Implement user management: viewer role, real API wiring, admin sidebar 2026-04-15 18:37:16 +01:00
services Round 2.5 feedback: TM replacements take effect, supplementary files reach LLM, larger briefs fit, free-text channel uploads 2026-05-05 14:28:20 -04:00
tasks Round 2.5 feedback: TM replacements take effect, supplementary files reach LLM, larger briefs fit, free-text channel uploads 2026-05-05 14:28:20 -04:00
ws feat: complete Phase 1-2 scaffold — backend, frontend, pipeline skeleton 2026-04-10 12:31:43 -04:00
__init__.py feat: complete Phase 1-2 scaffold — backend, frontend, pipeline skeleton 2026-04-10 12:31:43 -04:00
config.py Add Azure AD MSAL SSO (SPA token exchange) 2026-04-15 18:08:46 +01:00
dependencies.py feat: complete Phase 1-2 scaffold — backend, frontend, pipeline skeleton 2026-04-10 12:31:43 -04:00
main.py feat: complete Phase 1-2 scaffold — backend, frontend, pipeline skeleton 2026-04-10 12:31:43 -04:00