Implements word-level speech analysis using faster-whisper to refine AD pause points. Gemini's timestamps are snapped to natural speech gaps (sentence/phrase boundaries) to prevent pauses mid-word. Key changes: - Add WhisperService for transcription and gap detection - Add dedicated Celery task routed to 'whisper' queue - Integrate refinement into render_accessible_video task - Cache Whisper transcripts in MongoDB for reuse across languages - Add dedicated whisper-worker with concurrency=1 to prevent OOM Configuration: - Uses faster-whisper 'base' model (multilingual, ~145MB) - 5-second search window after Gemini's recommended point - Falls back to original timestamp if no gap found Infrastructure: - New Docker stage: whisper-worker - New Cloud Run service: accessible-video-whisper-worker - Updated docker-compose.yml with whisper-worker service 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
21 lines
704 B
Python
21 lines
704 B
Python
"""Schemas for Whisper transcript caching."""
|
|
|
|
from pydantic import BaseModel, Field
|
|
|
|
|
|
class CachedWordTimestamp(BaseModel):
|
|
"""Word timestamp for MongoDB storage."""
|
|
word: str
|
|
start: float
|
|
end: float
|
|
|
|
|
|
class CachedWhisperTranscript(BaseModel):
|
|
"""Cached Whisper transcript stored in job document."""
|
|
words: list[CachedWordTimestamp] = Field(
|
|
default_factory=list,
|
|
description="Word-level timestamps from Whisper"
|
|
)
|
|
model_name: str = Field(..., description="Whisper model used")
|
|
audio_duration: float = Field(..., description="Source audio duration in seconds")
|
|
created_at: str = Field(..., description="ISO timestamp when transcript was created")
|