video-accessibility/backend/app/schemas/whisper.py
michael 05bde8326d feat: add Whisper-based pause point refinement for audio descriptions
Implements word-level speech analysis using faster-whisper to refine
AD pause points. Gemini's timestamps are snapped to natural speech gaps
(sentence/phrase boundaries) to prevent pauses mid-word.

Key changes:
- Add WhisperService for transcription and gap detection
- Add dedicated Celery task routed to 'whisper' queue
- Integrate refinement into render_accessible_video task
- Cache Whisper transcripts in MongoDB for reuse across languages
- Add dedicated whisper-worker with concurrency=1 to prevent OOM

Configuration:
- Uses faster-whisper 'base' model (multilingual, ~145MB)
- 5-second search window after Gemini's recommended point
- Falls back to original timestamp if no gap found

Infrastructure:
- New Docker stage: whisper-worker
- New Cloud Run service: accessible-video-whisper-worker
- Updated docker-compose.yml with whisper-worker service

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-27 08:27:48 -06:00

21 lines
704 B
Python

"""Schemas for Whisper transcript caching."""
from pydantic import BaseModel, Field
class CachedWordTimestamp(BaseModel):
"""Word timestamp for MongoDB storage."""
word: str
start: float
end: float
class CachedWhisperTranscript(BaseModel):
"""Cached Whisper transcript stored in job document."""
words: list[CachedWordTimestamp] = Field(
default_factory=list,
description="Word-level timestamps from Whisper"
)
model_name: str = Field(..., description="Whisper model used")
audio_duration: float = Field(..., description="Source audio duration in seconds")
created_at: str = Field(..., description="ISO timestamp when transcript was created")