VTT Edit → Descriptive Transcript Regeneration

Problem

When a reviewer edits either captions.vtt or ad.vtt via PATCH /jobs/{id}/vtt, the descriptive transcript (descriptive_transcript.txt) becomes stale — it still reflects the pre-edit VTT content. This goes undetected because the transcript is not re-generated in the PATCH handler.

Pattern

In the PATCH handler, after writing the edited VTT(s) to GCS but before the MongoDB update:

Determine which stream was edited (request body) and which was not
Read the unchanged stream from GCS
Merge both streams via generate_descriptive_transcript(captions_text, ad_text)
Upload the new transcript to GCS
Update lang_output["descriptive_transcript_gcs"] so the MongoDB doc points to the fresh file

Wrap in a broad except so a transcript failure never blocks the VTT save.

# After GCS uploads for captions/AD:
if request.captions_vtt or request.audio_description_vtt:
    try:
        from ...services.descriptive_transcript import (
            generate_descriptive_transcript as _gen_transcript,
        )
        captions_text = request.captions_vtt
        if not captions_text:
            cc_gcs = lang_output.get("captions_vtt_gcs")
            if cc_gcs:
                _blob = gcs_service.bucket.blob(
                    cc_gcs.replace(f"gs://{settings.gcs_bucket}/", "")
                )
                captions_text = await asyncio.get_event_loop().run_in_executor(
                    gcs_service.executor, _blob.download_as_text
                )
        ad_text = request.audio_description_vtt
        if not ad_text:
            ad_gcs = lang_output.get("ad_vtt_gcs")
            if ad_gcs:
                _blob = gcs_service.bucket.blob(
                    ad_gcs.replace(f"gs://{settings.gcs_bucket}/", "")
                )
                ad_text = await asyncio.get_event_loop().run_in_executor(
                    gcs_service.executor, _blob.download_as_text
                )
        transcript_text = _gen_transcript(captions_text or "", ad_text or "")
        if transcript_text:
            transcript_uri = await upload_vtt_to_gcs(
                transcript_text,
                f"{job_id}/{target_language}/descriptive_transcript.txt",
            )
            lang_output["descriptive_transcript_gcs"] = transcript_uri
    except Exception as _tr_err:
        logger.warning(
            f"Failed to regenerate descriptive transcript for job {job_id}: {_tr_err}"
        )

Notes

asyncio.get_event_loop().run_in_executor(gcs_service.executor, blob.download_as_text) — use the GCS service's thread pool executor to keep GCS SDK calls off the async loop
The local import inside try/except avoids circular import issues if the module is conditionally present
Always update the GCS pointer in lang_output before the MongoDB update — the write is atomic at the document level, so both the VTT path and the transcript path update together
This pattern applies to any derived artifact that depends on two source VTT files

3.3 KiB Raw Blame History

VTT Edit → Descriptive Transcript Regeneration

Problem

Pattern

Notes

3.3 KiB

Raw Blame History