video-accessibility/backend/app/services/embedding_service.py
Vadym Samoilenko fa351e4d25 feat: per-client glossary — hybrid exact/vector retrieval + AI injection
Adds full glossary system so Gemini uses client-approved terminology
when generating subtitles and translations (critical for 3M brand names
and product codes across 16 target locales).

Backend:
- lib/locales.py: BCP-47 locale registry, normalises xlsx fr_fr → fr-FR
- models/glossary.py: Glossary / GlossaryVersion / GlossaryTerm + enums
- services/glossary_service.py: xlsx parse (openpyxl), ingest to Mongo,
  hybrid retrieval (Aho-Corasick exact + Atlas Vector Search), prompt block
- services/embedding_service.py: Gemini text-embedding-004, batch 100, retry
- tasks/embed_glossary.py: Celery background task for async embedding
- api/v1/routes_glossaries.py: CRUD endpoints under /clients/{id}/glossaries
- gemini.py: _build_glossary_block(), {GLOSSARY} injection in all 4 call sites
- tts.py / gemini_tts.py: pass full locale codes (no split("-")[0] truncation)
- tasks/translate_and_synthesize.py: glossary lookup + injection per language
- prompts: {GLOSSARY} placeholder in ingestion, targeted, transcreation prompts
- pyproject.toml: +openpyxl, +pyahocorasick

Frontend:
- routes/admin/glossaries/: GlossaryList, GlossaryUpload, GlossaryDetail
- App.tsx: 3 new routes under /admin/clients/:clientId/glossaries
- ClientDetail.tsx: Glossaries card with count + quick links
- types/api.ts: Glossary, GlossaryVersion, GlossaryDetail, GlossaryTerm types
- lib/api.ts: 7 new API methods (upload, list, detail, terms, versions, activate, archive)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 13:03:38 +01:00

72 lines
2.5 KiB
Python

"""
Embedding service backed by Gemini text-embedding-004.
Provides batch embedding with retry/backoff for use in glossary ingestion.
Batch size: 100 texts per API call (API limit is 2048 but we keep it conservative
for memory and retry ergonomics with large glossaries).
"""
from __future__ import annotations
import asyncio
from collections.abc import Sequence
from google import genai
from google.genai import types as genai_types
from ..core.config import settings
from ..core.logging import get_logger
logger = get_logger(__name__)
_EMBED_MODEL = "text-embedding-004"
_BATCH_SIZE = 100
_MAX_RETRIES = 3
_INITIAL_BACKOFF = 2.0
class EmbeddingService:
def __init__(self) -> None:
self._client = genai.Client(api_key=settings.gemini_api_key)
async def embed_texts(self, texts: Sequence[str]) -> list[list[float]]:
"""
Embed a list of texts and return a list of 768-dim float vectors.
Processes in batches; retries with exponential backoff on transient errors.
Order is preserved.
"""
results: list[list[float]] = []
for i in range(0, len(texts), _BATCH_SIZE):
batch = list(texts[i: i + _BATCH_SIZE])
vectors = await self._embed_batch_with_retry(batch)
results.extend(vectors)
return results
async def embed_text(self, text: str) -> list[float]:
vectors = await self.embed_texts([text])
return vectors[0]
async def _embed_batch_with_retry(self, texts: list[str]) -> list[list[float]]:
backoff = _INITIAL_BACKOFF
for attempt in range(1, _MAX_RETRIES + 1):
try:
response = await asyncio.to_thread(
self._client.models.embed_content,
model=_EMBED_MODEL,
contents=texts,
config=genai_types.EmbedContentConfig(
task_type="RETRIEVAL_DOCUMENT",
),
)
return [list(emb.values) for emb in response.embeddings]
except Exception as exc:
if attempt == _MAX_RETRIES:
logger.error(f"Embedding batch failed after {_MAX_RETRIES} attempts: {exc}")
raise
logger.warning(f"Embedding attempt {attempt} failed, retrying in {backoff}s: {exc}")
await asyncio.sleep(backoff)
backoff *= 2
raise RuntimeError("unreachable") # makes type-checker happy
embedding_service = EmbeddingService()