video-accessibility/docs/project/database_schema.md
Vadym Samoilenko 6559ccc1f9 feat(help): in-app role-based help guides + screenshot capture pipeline
- Help.tsx: role tabs, TOC scroll-spy, search, lightbox, react-markdown renderer
- 7 markdown guides (global, client, linguist, reviewer, production, PM, admin)
  with explicit click/drag/keyboard annotations throughout
- Sidebar: Help button added at bottom of nav (all roles)
- App.tsx: /help route, no RoleGate
- frontend/public/help-screenshots/{role}/: directories ready for screenshots
- tools/capture-help-screenshots.ts: Playwright screenshot script
  - Clicks "Local login" toggle before filling credentials
  - Uses test-admin local account (not SSO)
- backend/scripts/seed_test_users.py: idempotent MongoDB seed script
  creates 6 local-auth users (admin + 5 roles) for capture + local dev
- .env.screenshots.example: template with test-admin credentials
- Removes docs/video_accessibility_user_guide_v3.md (superseded by in-app guides)
- Deps: react-markdown, remark-gfm, rehype-raw added to frontend

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 13:08:13 +01:00

602 lines
17 KiB
Markdown

# Database Schema — Accessible Video Processing Platform
<!-- SCOPE: MongoDB collections, field definitions, indexes, and ER diagram. No implementation code. -->
<!-- DOC_KIND: reference -->
<!-- DOC_ROLE: canonical -->
<!-- READ_WHEN: Read when writing a query, adding a migration, or understanding data relationships. -->
<!-- SKIP_WHEN: Skip when you only need API contracts or infrastructure details. -->
<!-- PRIMARY_SOURCES: backend/app/models/*.py, backend/app/migrations/scripts/ -->
**Generated:** 2026-05-01
---
## Quick Navigation
- [Docs Hub](../README.md)
- [Architecture](architecture.md)
- [API Spec](api_spec.md)
## Agent Entry
| Signal | Value |
|--------|-------|
| Purpose | Authoritative reference for all MongoDB collections, fields, and indexes |
| Read When | You need collection structure, field types, or index strategy |
| Skip When | You need API endpoints → api_spec.md; infrastructure → infrastructure.md |
| Canonical | Yes |
| Next Docs | [Architecture](architecture.md), [API Spec](api_spec.md) |
| Primary Sources | `backend/app/models/*.py`, `backend/app/migrations/scripts/` |
---
## Overview
| Detail | Value |
|--------|-------|
| Database | MongoDB Atlas |
| DB name | `accessible_video` (env: `MONGODB_DB`) |
| Driver | Motor (async) via PyMongo |
| Schema style | Schema-on-read; Pydantic models enforce shape in application layer |
---
## ER Diagram
```mermaid
erDiagram
users {
ObjectId _id
string email
string hashed_password
string full_name
string role
string auth_provider
bool is_active
array pm_client_ids
array languages
datetime created_at
datetime updated_at
}
jobs {
ObjectId _id
string org_id
string client_id
string project_id
string created_by_user_id
string status
object source
object requested_outputs
object tts_preferences
map language_outputs
object review
map language_qc
object failure
datetime created_at
datetime updated_at
}
organizations {
ObjectId _id
string name
string slug
bool is_active
string plan
datetime created_at
datetime updated_at
}
memberships {
ObjectId _id
string user_id
string organization_id
string role_in_org
array team_ids
datetime created_at
string created_by
}
clients {
ObjectId _id
string name
string slug
bool is_active
datetime created_at
datetime updated_at
}
teams {
ObjectId _id
string name
string client_id
array member_user_ids
datetime created_at
}
projects {
ObjectId _id
string name
string client_id
bool is_active
array default_languages
string default_linguist_id
string default_reviewer_id
datetime created_at
}
glossaries {
ObjectId _id
string client_id
string name
string source_locale
string status
string current_version_id
datetime created_at
string created_by
}
glossary_versions {
ObjectId _id
string glossary_id
int version_number
int term_count
string embedding_status
datetime created_at
}
glossary_terms {
ObjectId _id
string glossary_id
string version_id
string source_term
map translations
}
job_briefs {
ObjectId _id
string organization_id
string project_id
string title
object requested_outputs
string status
string created_by
string job_id
datetime created_at
datetime updated_at
}
vtt_versions {
ObjectId _id
string job_id
string lang
string kind
int version
string content
string gcs_uri
datetime created_at
object created_by
}
share_tokens {
string _id
string job_id
string organization_id
string created_by_user_id
datetime expires_at
bool is_active
}
invitations {
ObjectId _id
string email
string organization_id
string role_in_org
string token_hash
datetime expires_at
datetime accepted_at
}
review_notes {
ObjectId _id
string job_id
string asset_key
float timestamp_seconds
string content
string user_id
datetime created_at
}
audit_logs {
ObjectId _id
string action
string user_id
string target_type
string target_id
object details
datetime timestamp
}
users ||--o{ memberships : "belongs to org via"
organizations ||--o{ memberships : "has"
users ||--o{ jobs : "creates"
clients ||--o{ jobs : "owns"
projects ||--o{ jobs : "groups"
clients ||--o{ teams : "has"
clients ||--o{ projects : "has"
clients ||--o{ glossaries : "owns"
glossaries ||--o{ glossary_versions : "has"
glossary_versions ||--o{ glossary_terms : "contains"
jobs ||--o{ vtt_versions : "has"
jobs ||--o{ review_notes : "has"
jobs ||--o{ share_tokens : "has"
organizations ||--o{ job_briefs : "submits"
```
---
## Collections
### `users`
| Field | Type | Required | Notes |
|-------|------|----------|-------|
| `_id` | ObjectId | Yes | Primary key |
| `email` | string | Yes | Unique; login identity |
| `hashed_password` | string | No | Null for Microsoft SSO users |
| `full_name` | string | Yes | Display name |
| `role` | string | Yes | Enum: `client`, `linguist`, `reviewer`, `production`, `project_manager`, `admin` |
| `auth_provider` | string | Yes | Enum: `local`, `microsoft` |
| `is_active` | bool | Yes | Soft-disable without deleting |
| `pm_client_ids` | string[] | No | Client IDs where user is Project Manager |
| `languages` | string[] | No | BCP-47 codes (linguist/reviewer competency) |
| `created_at` | datetime | Yes | |
| `updated_at` | datetime | No | |
**Indexes:** `email` (unique), `role`, `is_active`, `created_at` (desc), `auth_provider`
---
### `jobs`
| Field | Type | Required | Notes |
|-------|------|----------|-------|
| `_id` | ObjectId | Yes | Primary key |
| `org_id` | string | No | Owning organisation |
| `client_id` | string | No | Owning client |
| `project_id` | string | No | Owning project |
| `created_by_user_id` | string | Yes | Uploader |
| `status` | string | Yes | See Job Status Enum below |
| `source` | object | Yes | `{filename, gcs_uri, duration_s, language, detected_language}` |
| `requested_outputs` | object | Yes | `{captions_vtt, audio_description_vtt, audio_description_mp3, accessible_video_mp4, languages[], tts_preferences}` |
| `tts_preferences` | object | No | `{provider, default_voice, voices_per_language, model, speed, style_preset}` |
| `language_outputs` | map | No | Key: BCP-47 lang code → `LangOutput` object |
| `review` | object | No | `{notes, reviewer_id, history[]}` |
| `language_qc` | map | No | Key: BCP-47 lang code → `LanguageQCState` |
| `failure` | object | No | `{step, type, message, retriable, occurred_at, retry_count}` |
| `created_at` | datetime | Yes | |
| `updated_at` | datetime | No | |
**`LangOutput` sub-document:**
| Field | Type | Notes |
|-------|------|-------|
| `captions_vtt_gcs` | string | GCS URI |
| `sdh_captions_vtt_gcs` | string | SDH captions (speaker labels, sound effects, music) |
| `ad_vtt_gcs` | string | Audio description VTT |
| `ad_mp3_gcs` | string | Audio description MP3 |
| `accessible_video_gcs` | string | Rendered accessible MP4 |
| `accessible_video_method` | string | `overlay` or `pause_insert` |
| `retimed_captions_vtt_gcs` | string | Re-timed captions for pause-insert |
| `ad_cue_manifest` | array | Per-cue: `{cue_index, gcs_uri, text, duration_s}` |
| `accessible_video_edit_state` | object | Pause points, video segments, TTS regen queue |
| `descriptive_transcript_gcs` | string | WCAG combined transcript |
| `origin` | string | `translate`, `transcreate`, `gemini_translate`, `video_native` |
| `qa_notes` | string | QA reviewer notes |
**`LanguageQCState` sub-document:**
| Field | Type | Notes |
|-------|------|-------|
| `status` | string | `pending`, `in_progress`, `pending_review`, `in_review`, `approved`, `rejected` |
| `assigned_linguist_id` | string | |
| `assigned_linguist_email` | string | |
| `assigned_reviewer_id` | string | |
| `assigned_reviewer_email` | string | |
| `linguist_deadline` | datetime | |
| `reviewer_deadline` | datetime | |
| `total_cues` | int | Set when reviewer opens |
| `reviewed_cues` | int | Incremented as cues are reviewed |
| `reject_category` | string | `timing`, `mistranslation`, `terminology`, `profanity`, `length` |
| `history` | array | `LanguageQCEvent[]` |
| `comments` | array | `LanguageQCComment[]` |
**Job Status Enum:**
| Status | Meaning |
|--------|---------|
| `created` | Job record created, upload pending |
| `ingesting` | Celery worker downloading/preparing video |
| `ai_processing` | Gemini 2.5 Pro generating VTT |
| `pending_qc` | Ready for QC review |
| `approved_english` | English source VTT approved |
| `approved_source` | Non-English source VTT approved |
| `rejected` | Rejected at QC |
| `qc_feedback` | Returned with QC notes |
| `translating` | Google Translate running |
| `tts_generating` | TTS synthesis in progress |
| `tts_failed` | Legacy; use `processing_failed` + `failure.step="tts"` for new failures |
| `rendering_video` | Accessible video render in progress |
| `render_failed` | Legacy failure status |
| `processing_failed` | Unified failure — see `failure.step` |
| `rendering_qc` | Re-rendering during QC |
| `pending_final_review` | PM final sign-off required |
| `completed` | Fully delivered |
| `cancelled` | Manually cancelled |
**Indexes:** `(status, created_at desc)`, `org_id`, `client_id`, `project_id`, `created_by_user_id`
---
### `organizations`
| Field | Type | Required | Notes |
|-------|------|----------|-------|
| `_id` | ObjectId | Yes | |
| `name` | string | Yes | |
| `slug` | string | Yes | URL-safe identifier |
| `is_active` | bool | Yes | |
| `plan` | string | Yes | Default: `standard` |
| `created_at` | datetime | Yes | |
| `updated_at` | datetime | No | |
---
### `memberships`
| Field | Type | Required | Notes |
|-------|------|----------|-------|
| `_id` | ObjectId | Yes | |
| `user_id` | string | Yes | |
| `organization_id` | string | Yes | |
| `role_in_org` | string | Yes | `owner`, `admin`, `manager`, `member`, `viewer` |
| `team_ids` | string[] | No | Teams within the org |
| `created_at` | datetime | No | |
| `created_by` | string | No | |
**Indexes:** `(user_id, organization_id)` unique, `organization_id`, `user_id`
---
### `clients`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `name` | string | |
| `slug` | string | |
| `is_active` | bool | |
| `created_at` | datetime | |
| `updated_at` | datetime | |
---
### `teams`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `name` | string | |
| `client_id` | string | FK → clients |
| `member_user_ids` | string[] | User IDs |
| `created_at` | datetime | |
---
### `projects`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `name` | string | |
| `client_id` | string | FK → clients |
| `is_active` | bool | |
| `default_languages` | string[] | BCP-47 language codes |
| `default_linguist_id` | string | |
| `default_reviewer_id` | string | |
| `created_at` | datetime | |
| `updated_at` | datetime | |
---
### `glossaries`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `client_id` | string | FK → clients |
| `name` | string | |
| `description` | string | Optional |
| `source_locale` | string | BCP-47, e.g. `en-GB` |
| `source` | string | `xlsx_upload` or `fraze_api` |
| `status` | string | `active`, `archived` |
| `current_version_id` | string | FK → glossary_versions |
| `created_at` | datetime | |
| `created_by` | string | user_id |
---
### `glossary_versions`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `glossary_id` | string | FK → glossaries |
| `version_number` | int | Monotonically increasing |
| `source_xlsx_gcs_path` | string | Original XLSX on GCS |
| `term_count` | int | |
| `embedded_count` | int | Terms with embeddings |
| `embedding_status` | string | `pending`, `in_progress`, `done`, `failed` |
| `created_at` | datetime | |
| `created_by` | string | user_id |
| `change_note` | string | Optional version note |
---
### `glossary_terms`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `glossary_id` | string | FK → glossaries |
| `version_id` | string | FK → glossary_versions |
| source term | string | Source language term |
| translations | map | locale → translated string |
---
### `job_briefs`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `organization_id` | string | FK → organizations |
| `project_id` | string | Optional FK → projects |
| `title` | string | |
| `description` | string | Optional |
| `requested_outputs` | object | Same shape as `jobs.requested_outputs` |
| `languages` | string[] | Requested target languages |
| `deadline` | datetime | Optional |
| `status` | string | `draft`, `submitted`, `approved`, `rejected`, `fulfilled` |
| `created_by` | string | user_id |
| `job_id` | string | FK → jobs (set when brief is fulfilled) |
| `created_at` | datetime | |
| `updated_at` | datetime | |
| `submitted_at` | datetime | |
| `approved_by` | string | PM user_id |
| `reject_reason` | string | |
---
### `vtt_versions`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `job_id` | string | FK → jobs |
| `lang` | string | BCP-47 language code |
| `kind` | string | `captions` or `ad` |
| `version` | int | Monotonically increasing per (job_id, lang, kind) |
| `content` | string | Full VTT file content |
| `gcs_uri` | string | Backup copy on GCS |
| `created_at` | datetime | |
| `created_by` | object | `{user_id, user_email}` |
| `note` | string | Editor note |
| `parent_version` | int | Version this was derived from |
| `cue_count` | int | |
| `byte_size` | int | |
---
### `share_tokens`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | string | Token itself (32 hex chars) — used as PK |
| `job_id` | string | FK → jobs |
| `organization_id` | string | FK → organizations |
| `created_by_user_id` | string | |
| `created_by_email` | string | Denormalised |
| `created_at` | datetime | |
| `expires_at` | datetime | Null = never expires |
| `is_active` | bool | |
| `label` | string | Human note |
---
### `invitations`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `email` | string | Invitee email |
| `organization_id` | string | FK → organizations |
| `role_in_org` | string | OrgRole enum |
| `target_team_ids` | string[] | Teams to join on accept |
| `token_hash` | string | SHA-256 hash of invitation token |
| `invited_by_user_id` | string | |
| `expires_at` | datetime | |
| `accepted_at` | datetime | Null if pending |
| `revoked_at` | datetime | Null if active |
| `created_at` | datetime | |
**Indexes:** `token_hash`, `organization_id`, `email`, TTL on `expires_at`
---
### `review_notes`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `job_id` | string | FK → jobs |
| `asset_key` | string | e.g. `en`, `es`, `en_accessible` |
| `timestamp_seconds` | float | Video timestamp when note was placed |
| `content` | string | Note text |
| `user_id` | string | Author |
| `user_name` | string | Denormalised display name |
| `created_at` | datetime | |
| `updated_at` | datetime | |
---
### `audit_logs`
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `action` | string | Dotted enum, e.g. `auth.login.success`, `job.approve` |
| `user_id` | string | Actor |
| `target_type` | string | e.g. `job`, `user` |
| `target_id` | string | Target resource ID |
| `details` | object | Freeform context |
| `timestamp` | datetime | |
**Indexes:** `timestamp` (desc), `user_id`, `action`, `(target_type, target_id)`
---
### `migrations`
Internal collection managed by `backend/app/migrations/migrator.py`.
| Field | Type | Notes |
|-------|------|-------|
| `_id` | ObjectId | |
| `version` | string | Migration script name/timestamp |
| `applied_at` | datetime | |
**Indexes:** `version` (unique), `applied_at` (desc)
---
## GCS File Layout
All binary assets are stored in GCS; MongoDB stores only GCS URIs.
```text
gs://{GCS_BUCKET}/jobs/{job_id}/
source.mp4
{lang}/
captions.vtt
ad.vtt
ad.mp3
accessible.mp4
sdh_captions.vtt
descriptive_transcript.txt
cues/
cue_{n}.mp3
```
---
## Maintenance
**Last Updated:** 2026-05-01
**Update Triggers:**
- New model file added in `backend/app/models/`
- New migration script that adds fields or collections
- Field added/removed from existing Pydantic model
- New index created in a migration
**Verification:**
- [ ] All files in `backend/app/models/` represented
- [ ] All indexes from `backend/app/migrations/scripts/` reflected
- [ ] Job status enum matches `JobStatus` in `models/job.py`
- [ ] ER diagram relationships accurate