# Database Schema — Accessible Video Processing Platform **Generated:** 2026-05-01 --- ## Quick Navigation - [Docs Hub](../README.md) - [Architecture](architecture.md) - [API Spec](api_spec.md) ## Agent Entry | Signal | Value | |--------|-------| | Purpose | Authoritative reference for all MongoDB collections, fields, and indexes | | Read When | You need collection structure, field types, or index strategy | | Skip When | You need API endpoints → api_spec.md; infrastructure → infrastructure.md | | Canonical | Yes | | Next Docs | [Architecture](architecture.md), [API Spec](api_spec.md) | | Primary Sources | `backend/app/models/*.py`, `backend/app/migrations/scripts/` | --- ## Overview | Detail | Value | |--------|-------| | Database | MongoDB Atlas | | DB name | `accessible_video` (env: `MONGODB_DB`) | | Driver | Motor (async) via PyMongo | | Schema style | Schema-on-read; Pydantic models enforce shape in application layer | --- ## ER Diagram ```mermaid erDiagram users { ObjectId _id string email string hashed_password string full_name string role string auth_provider bool is_active array pm_client_ids array languages datetime created_at datetime updated_at } jobs { ObjectId _id string org_id string client_id string project_id string created_by_user_id string status object source object requested_outputs object tts_preferences map language_outputs object review map language_qc object failure datetime created_at datetime updated_at } organizations { ObjectId _id string name string slug bool is_active string plan datetime created_at datetime updated_at } memberships { ObjectId _id string user_id string organization_id string role_in_org array team_ids datetime created_at string created_by } clients { ObjectId _id string name string slug bool is_active datetime created_at datetime updated_at } teams { ObjectId _id string name string client_id array member_user_ids datetime created_at } projects { ObjectId _id string name string client_id bool is_active array default_languages string default_linguist_id string default_reviewer_id datetime created_at } glossaries { ObjectId _id string client_id string name string source_locale string status string current_version_id datetime created_at string created_by } glossary_versions { ObjectId _id string glossary_id int version_number int term_count string embedding_status datetime created_at } glossary_terms { ObjectId _id string glossary_id string version_id string source_term map translations } job_briefs { ObjectId _id string organization_id string project_id string title object requested_outputs string status string created_by string job_id datetime created_at datetime updated_at } vtt_versions { ObjectId _id string job_id string lang string kind int version string content string gcs_uri datetime created_at object created_by } share_tokens { string _id string job_id string organization_id string created_by_user_id datetime expires_at bool is_active } invitations { ObjectId _id string email string organization_id string role_in_org string token_hash datetime expires_at datetime accepted_at } review_notes { ObjectId _id string job_id string asset_key float timestamp_seconds string content string user_id datetime created_at } audit_logs { ObjectId _id string action string user_id string target_type string target_id object details datetime timestamp } users ||--o{ memberships : "belongs to org via" organizations ||--o{ memberships : "has" users ||--o{ jobs : "creates" clients ||--o{ jobs : "owns" projects ||--o{ jobs : "groups" clients ||--o{ teams : "has" clients ||--o{ projects : "has" clients ||--o{ glossaries : "owns" glossaries ||--o{ glossary_versions : "has" glossary_versions ||--o{ glossary_terms : "contains" jobs ||--o{ vtt_versions : "has" jobs ||--o{ review_notes : "has" jobs ||--o{ share_tokens : "has" organizations ||--o{ job_briefs : "submits" ``` --- ## Collections ### `users` | Field | Type | Required | Notes | |-------|------|----------|-------| | `_id` | ObjectId | Yes | Primary key | | `email` | string | Yes | Unique; login identity | | `hashed_password` | string | No | Null for Microsoft SSO users | | `full_name` | string | Yes | Display name | | `role` | string | Yes | Enum: `client`, `linguist`, `reviewer`, `production`, `project_manager`, `admin` | | `auth_provider` | string | Yes | Enum: `local`, `microsoft` | | `is_active` | bool | Yes | Soft-disable without deleting | | `pm_client_ids` | string[] | No | Client IDs where user is Project Manager | | `languages` | string[] | No | BCP-47 codes (linguist/reviewer competency) | | `created_at` | datetime | Yes | | | `updated_at` | datetime | No | | **Indexes:** `email` (unique), `role`, `is_active`, `created_at` (desc), `auth_provider` --- ### `jobs` | Field | Type | Required | Notes | |-------|------|----------|-------| | `_id` | ObjectId | Yes | Primary key | | `org_id` | string | No | Owning organisation | | `client_id` | string | No | Owning client | | `project_id` | string | No | Owning project | | `created_by_user_id` | string | Yes | Uploader | | `status` | string | Yes | See Job Status Enum below | | `source` | object | Yes | `{filename, gcs_uri, duration_s, language, detected_language}` | | `requested_outputs` | object | Yes | `{captions_vtt, audio_description_vtt, audio_description_mp3, accessible_video_mp4, languages[], tts_preferences}` | | `tts_preferences` | object | No | `{provider, default_voice, voices_per_language, model, speed, style_preset}` | | `language_outputs` | map | No | Key: BCP-47 lang code → `LangOutput` object | | `review` | object | No | `{notes, reviewer_id, history[]}` | | `language_qc` | map | No | Key: BCP-47 lang code → `LanguageQCState` | | `failure` | object | No | `{step, type, message, retriable, occurred_at, retry_count}` | | `created_at` | datetime | Yes | | | `updated_at` | datetime | No | | **`LangOutput` sub-document:** | Field | Type | Notes | |-------|------|-------| | `captions_vtt_gcs` | string | GCS URI | | `sdh_captions_vtt_gcs` | string | SDH captions (speaker labels, sound effects, music) | | `ad_vtt_gcs` | string | Audio description VTT | | `ad_mp3_gcs` | string | Audio description MP3 | | `accessible_video_gcs` | string | Rendered accessible MP4 | | `accessible_video_method` | string | `overlay` or `pause_insert` | | `retimed_captions_vtt_gcs` | string | Re-timed captions for pause-insert | | `ad_cue_manifest` | array | Per-cue: `{cue_index, gcs_uri, text, duration_s}` | | `accessible_video_edit_state` | object | Pause points, video segments, TTS regen queue | | `descriptive_transcript_gcs` | string | WCAG combined transcript | | `origin` | string | `translate`, `transcreate`, `gemini_translate`, `video_native` | | `qa_notes` | string | QA reviewer notes | **`LanguageQCState` sub-document:** | Field | Type | Notes | |-------|------|-------| | `status` | string | `pending`, `in_progress`, `pending_review`, `in_review`, `approved`, `rejected` | | `assigned_linguist_id` | string | | | `assigned_linguist_email` | string | | | `assigned_reviewer_id` | string | | | `assigned_reviewer_email` | string | | | `linguist_deadline` | datetime | | | `reviewer_deadline` | datetime | | | `total_cues` | int | Set when reviewer opens | | `reviewed_cues` | int | Incremented as cues are reviewed | | `reject_category` | string | `timing`, `mistranslation`, `terminology`, `profanity`, `length` | | `history` | array | `LanguageQCEvent[]` | | `comments` | array | `LanguageQCComment[]` | **Job Status Enum:** | Status | Meaning | |--------|---------| | `created` | Job record created, upload pending | | `ingesting` | Celery worker downloading/preparing video | | `ai_processing` | Gemini 2.5 Pro generating VTT | | `pending_qc` | Ready for QC review | | `approved_english` | English source VTT approved | | `approved_source` | Non-English source VTT approved | | `rejected` | Rejected at QC | | `qc_feedback` | Returned with QC notes | | `translating` | Google Translate running | | `tts_generating` | TTS synthesis in progress | | `tts_failed` | Legacy; use `processing_failed` + `failure.step="tts"` for new failures | | `rendering_video` | Accessible video render in progress | | `render_failed` | Legacy failure status | | `processing_failed` | Unified failure — see `failure.step` | | `rendering_qc` | Re-rendering during QC | | `pending_final_review` | PM final sign-off required | | `completed` | Fully delivered | | `cancelled` | Manually cancelled | **Indexes:** `(status, created_at desc)`, `org_id`, `client_id`, `project_id`, `created_by_user_id` --- ### `organizations` | Field | Type | Required | Notes | |-------|------|----------|-------| | `_id` | ObjectId | Yes | | | `name` | string | Yes | | | `slug` | string | Yes | URL-safe identifier | | `is_active` | bool | Yes | | | `plan` | string | Yes | Default: `standard` | | `created_at` | datetime | Yes | | | `updated_at` | datetime | No | | --- ### `memberships` | Field | Type | Required | Notes | |-------|------|----------|-------| | `_id` | ObjectId | Yes | | | `user_id` | string | Yes | | | `organization_id` | string | Yes | | | `role_in_org` | string | Yes | `owner`, `admin`, `manager`, `member`, `viewer` | | `team_ids` | string[] | No | Teams within the org | | `created_at` | datetime | No | | | `created_by` | string | No | | **Indexes:** `(user_id, organization_id)` unique, `organization_id`, `user_id` --- ### `clients` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `name` | string | | | `slug` | string | | | `is_active` | bool | | | `created_at` | datetime | | | `updated_at` | datetime | | --- ### `teams` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `name` | string | | | `client_id` | string | FK → clients | | `member_user_ids` | string[] | User IDs | | `created_at` | datetime | | --- ### `projects` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `name` | string | | | `client_id` | string | FK → clients | | `is_active` | bool | | | `default_languages` | string[] | BCP-47 language codes | | `default_linguist_id` | string | | | `default_reviewer_id` | string | | | `created_at` | datetime | | | `updated_at` | datetime | | --- ### `glossaries` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `client_id` | string | FK → clients | | `name` | string | | | `description` | string | Optional | | `source_locale` | string | BCP-47, e.g. `en-GB` | | `source` | string | `xlsx_upload` or `fraze_api` | | `status` | string | `active`, `archived` | | `current_version_id` | string | FK → glossary_versions | | `created_at` | datetime | | | `created_by` | string | user_id | --- ### `glossary_versions` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `glossary_id` | string | FK → glossaries | | `version_number` | int | Monotonically increasing | | `source_xlsx_gcs_path` | string | Original XLSX on GCS | | `term_count` | int | | | `embedded_count` | int | Terms with embeddings | | `embedding_status` | string | `pending`, `in_progress`, `done`, `failed` | | `created_at` | datetime | | | `created_by` | string | user_id | | `change_note` | string | Optional version note | --- ### `glossary_terms` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `glossary_id` | string | FK → glossaries | | `version_id` | string | FK → glossary_versions | | source term | string | Source language term | | translations | map | locale → translated string | --- ### `job_briefs` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `organization_id` | string | FK → organizations | | `project_id` | string | Optional FK → projects | | `title` | string | | | `description` | string | Optional | | `requested_outputs` | object | Same shape as `jobs.requested_outputs` | | `languages` | string[] | Requested target languages | | `deadline` | datetime | Optional | | `status` | string | `draft`, `submitted`, `approved`, `rejected`, `fulfilled` | | `created_by` | string | user_id | | `job_id` | string | FK → jobs (set when brief is fulfilled) | | `created_at` | datetime | | | `updated_at` | datetime | | | `submitted_at` | datetime | | | `approved_by` | string | PM user_id | | `reject_reason` | string | | --- ### `vtt_versions` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `job_id` | string | FK → jobs | | `lang` | string | BCP-47 language code | | `kind` | string | `captions` or `ad` | | `version` | int | Monotonically increasing per (job_id, lang, kind) | | `content` | string | Full VTT file content | | `gcs_uri` | string | Backup copy on GCS | | `created_at` | datetime | | | `created_by` | object | `{user_id, user_email}` | | `note` | string | Editor note | | `parent_version` | int | Version this was derived from | | `cue_count` | int | | | `byte_size` | int | | --- ### `share_tokens` | Field | Type | Notes | |-------|------|-------| | `_id` | string | Token itself (32 hex chars) — used as PK | | `job_id` | string | FK → jobs | | `organization_id` | string | FK → organizations | | `created_by_user_id` | string | | | `created_by_email` | string | Denormalised | | `created_at` | datetime | | | `expires_at` | datetime | Null = never expires | | `is_active` | bool | | | `label` | string | Human note | --- ### `invitations` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `email` | string | Invitee email | | `organization_id` | string | FK → organizations | | `role_in_org` | string | OrgRole enum | | `target_team_ids` | string[] | Teams to join on accept | | `token_hash` | string | SHA-256 hash of invitation token | | `invited_by_user_id` | string | | | `expires_at` | datetime | | | `accepted_at` | datetime | Null if pending | | `revoked_at` | datetime | Null if active | | `created_at` | datetime | | **Indexes:** `token_hash`, `organization_id`, `email`, TTL on `expires_at` --- ### `review_notes` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `job_id` | string | FK → jobs | | `asset_key` | string | e.g. `en`, `es`, `en_accessible` | | `timestamp_seconds` | float | Video timestamp when note was placed | | `content` | string | Note text | | `user_id` | string | Author | | `user_name` | string | Denormalised display name | | `created_at` | datetime | | | `updated_at` | datetime | | --- ### `audit_logs` | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `action` | string | Dotted enum, e.g. `auth.login.success`, `job.approve` | | `user_id` | string | Actor | | `target_type` | string | e.g. `job`, `user` | | `target_id` | string | Target resource ID | | `details` | object | Freeform context | | `timestamp` | datetime | | **Indexes:** `timestamp` (desc), `user_id`, `action`, `(target_type, target_id)` --- ### `migrations` Internal collection managed by `backend/app/migrations/migrator.py`. | Field | Type | Notes | |-------|------|-------| | `_id` | ObjectId | | | `version` | string | Migration script name/timestamp | | `applied_at` | datetime | | **Indexes:** `version` (unique), `applied_at` (desc) --- ## GCS File Layout All binary assets are stored in GCS; MongoDB stores only GCS URIs. ```text gs://{GCS_BUCKET}/jobs/{job_id}/ source.mp4 {lang}/ captions.vtt ad.vtt ad.mp3 accessible.mp4 sdh_captions.vtt descriptive_transcript.txt cues/ cue_{n}.mp3 ``` --- ## Maintenance **Last Updated:** 2026-05-01 **Update Triggers:** - New model file added in `backend/app/models/` - New migration script that adds fields or collections - Field added/removed from existing Pydantic model - New index created in a migration **Verification:** - [ ] All files in `backend/app/models/` represented - [ ] All indexes from `backend/app/migrations/scripts/` reflected - [ ] Job status enum matches `JobStatus` in `models/job.py` - [ ] ER diagram relationships accurate