3.3 KiB
3.3 KiB
| title | description | tags | created | updated | projects | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| VTT Edit → Descriptive Transcript Regeneration | Pattern for keeping descriptive_transcript.txt in sync when captions or audio description VTTs are edited via PATCH /vtt |
|
2026-05-01 | 2026-05-01 |
|
VTT Edit → Descriptive Transcript Regeneration
Problem
When a reviewer edits either captions.vtt or ad.vtt via PATCH /jobs/{id}/vtt, the descriptive transcript (descriptive_transcript.txt) becomes stale — it still reflects the pre-edit VTT content. This goes undetected because the transcript is not re-generated in the PATCH handler.
Pattern
In the PATCH handler, after writing the edited VTT(s) to GCS but before the MongoDB update:
- Determine which stream was edited (request body) and which was not
- Read the unchanged stream from GCS
- Merge both streams via
generate_descriptive_transcript(captions_text, ad_text) - Upload the new transcript to GCS
- Update
lang_output["descriptive_transcript_gcs"]so the MongoDB doc points to the fresh file
Wrap in a broad except so a transcript failure never blocks the VTT save.
# After GCS uploads for captions/AD:
if request.captions_vtt or request.audio_description_vtt:
try:
from ...services.descriptive_transcript import (
generate_descriptive_transcript as _gen_transcript,
)
captions_text = request.captions_vtt
if not captions_text:
cc_gcs = lang_output.get("captions_vtt_gcs")
if cc_gcs:
_blob = gcs_service.bucket.blob(
cc_gcs.replace(f"gs://{settings.gcs_bucket}/", "")
)
captions_text = await asyncio.get_event_loop().run_in_executor(
gcs_service.executor, _blob.download_as_text
)
ad_text = request.audio_description_vtt
if not ad_text:
ad_gcs = lang_output.get("ad_vtt_gcs")
if ad_gcs:
_blob = gcs_service.bucket.blob(
ad_gcs.replace(f"gs://{settings.gcs_bucket}/", "")
)
ad_text = await asyncio.get_event_loop().run_in_executor(
gcs_service.executor, _blob.download_as_text
)
transcript_text = _gen_transcript(captions_text or "", ad_text or "")
if transcript_text:
transcript_uri = await upload_vtt_to_gcs(
transcript_text,
f"{job_id}/{target_language}/descriptive_transcript.txt",
)
lang_output["descriptive_transcript_gcs"] = transcript_uri
except Exception as _tr_err:
logger.warning(
f"Failed to regenerate descriptive transcript for job {job_id}: {_tr_err}"
)
Notes
asyncio.get_event_loop().run_in_executor(gcs_service.executor, blob.download_as_text)— use the GCS service's thread pool executor to keep GCS SDK calls off the async loop- The local import inside try/except avoids circular import issues if the module is conditionally present
- Always update the GCS pointer in
lang_outputbefore the MongoDB update — the write is atomic at the document level, so both the VTT path and the transcript path update together - This pattern applies to any derived artifact that depends on two source VTT files