video-accessibility/docs/project/database_schema.md
Vadym Samoilenko 6559ccc1f9 feat(help): in-app role-based help guides + screenshot capture pipeline
- Help.tsx: role tabs, TOC scroll-spy, search, lightbox, react-markdown renderer
- 7 markdown guides (global, client, linguist, reviewer, production, PM, admin)
  with explicit click/drag/keyboard annotations throughout
- Sidebar: Help button added at bottom of nav (all roles)
- App.tsx: /help route, no RoleGate
- frontend/public/help-screenshots/{role}/: directories ready for screenshots
- tools/capture-help-screenshots.ts: Playwright screenshot script
  - Clicks "Local login" toggle before filling credentials
  - Uses test-admin local account (not SSO)
- backend/scripts/seed_test_users.py: idempotent MongoDB seed script
  creates 6 local-auth users (admin + 5 roles) for capture + local dev
- .env.screenshots.example: template with test-admin credentials
- Removes docs/video_accessibility_user_guide_v3.md (superseded by in-app guides)
- Deps: react-markdown, remark-gfm, rehype-raw added to frontend

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 13:08:13 +01:00

17 KiB

Database Schema — Accessible Video Processing Platform

Generated: 2026-05-01


Quick Navigation

Agent Entry

Signal Value
Purpose Authoritative reference for all MongoDB collections, fields, and indexes
Read When You need collection structure, field types, or index strategy
Skip When You need API endpoints → api_spec.md; infrastructure → infrastructure.md
Canonical Yes
Next Docs Architecture, API Spec
Primary Sources backend/app/models/*.py, backend/app/migrations/scripts/

Overview

Detail Value
Database MongoDB Atlas
DB name accessible_video (env: MONGODB_DB)
Driver Motor (async) via PyMongo
Schema style Schema-on-read; Pydantic models enforce shape in application layer

ER Diagram

erDiagram
    users {
        ObjectId _id
        string email
        string hashed_password
        string full_name
        string role
        string auth_provider
        bool is_active
        array pm_client_ids
        array languages
        datetime created_at
        datetime updated_at
    }
    jobs {
        ObjectId _id
        string org_id
        string client_id
        string project_id
        string created_by_user_id
        string status
        object source
        object requested_outputs
        object tts_preferences
        map language_outputs
        object review
        map language_qc
        object failure
        datetime created_at
        datetime updated_at
    }
    organizations {
        ObjectId _id
        string name
        string slug
        bool is_active
        string plan
        datetime created_at
        datetime updated_at
    }
    memberships {
        ObjectId _id
        string user_id
        string organization_id
        string role_in_org
        array team_ids
        datetime created_at
        string created_by
    }
    clients {
        ObjectId _id
        string name
        string slug
        bool is_active
        datetime created_at
        datetime updated_at
    }
    teams {
        ObjectId _id
        string name
        string client_id
        array member_user_ids
        datetime created_at
    }
    projects {
        ObjectId _id
        string name
        string client_id
        bool is_active
        array default_languages
        string default_linguist_id
        string default_reviewer_id
        datetime created_at
    }
    glossaries {
        ObjectId _id
        string client_id
        string name
        string source_locale
        string status
        string current_version_id
        datetime created_at
        string created_by
    }
    glossary_versions {
        ObjectId _id
        string glossary_id
        int version_number
        int term_count
        string embedding_status
        datetime created_at
    }
    glossary_terms {
        ObjectId _id
        string glossary_id
        string version_id
        string source_term
        map translations
    }
    job_briefs {
        ObjectId _id
        string organization_id
        string project_id
        string title
        object requested_outputs
        string status
        string created_by
        string job_id
        datetime created_at
        datetime updated_at
    }
    vtt_versions {
        ObjectId _id
        string job_id
        string lang
        string kind
        int version
        string content
        string gcs_uri
        datetime created_at
        object created_by
    }
    share_tokens {
        string _id
        string job_id
        string organization_id
        string created_by_user_id
        datetime expires_at
        bool is_active
    }
    invitations {
        ObjectId _id
        string email
        string organization_id
        string role_in_org
        string token_hash
        datetime expires_at
        datetime accepted_at
    }
    review_notes {
        ObjectId _id
        string job_id
        string asset_key
        float timestamp_seconds
        string content
        string user_id
        datetime created_at
    }
    audit_logs {
        ObjectId _id
        string action
        string user_id
        string target_type
        string target_id
        object details
        datetime timestamp
    }

    users ||--o{ memberships : "belongs to org via"
    organizations ||--o{ memberships : "has"
    users ||--o{ jobs : "creates"
    clients ||--o{ jobs : "owns"
    projects ||--o{ jobs : "groups"
    clients ||--o{ teams : "has"
    clients ||--o{ projects : "has"
    clients ||--o{ glossaries : "owns"
    glossaries ||--o{ glossary_versions : "has"
    glossary_versions ||--o{ glossary_terms : "contains"
    jobs ||--o{ vtt_versions : "has"
    jobs ||--o{ review_notes : "has"
    jobs ||--o{ share_tokens : "has"
    organizations ||--o{ job_briefs : "submits"

Collections

users

Field Type Required Notes
_id ObjectId Yes Primary key
email string Yes Unique; login identity
hashed_password string No Null for Microsoft SSO users
full_name string Yes Display name
role string Yes Enum: client, linguist, reviewer, production, project_manager, admin
auth_provider string Yes Enum: local, microsoft
is_active bool Yes Soft-disable without deleting
pm_client_ids string[] No Client IDs where user is Project Manager
languages string[] No BCP-47 codes (linguist/reviewer competency)
created_at datetime Yes
updated_at datetime No

Indexes: email (unique), role, is_active, created_at (desc), auth_provider


jobs

Field Type Required Notes
_id ObjectId Yes Primary key
org_id string No Owning organisation
client_id string No Owning client
project_id string No Owning project
created_by_user_id string Yes Uploader
status string Yes See Job Status Enum below
source object Yes {filename, gcs_uri, duration_s, language, detected_language}
requested_outputs object Yes {captions_vtt, audio_description_vtt, audio_description_mp3, accessible_video_mp4, languages[], tts_preferences}
tts_preferences object No {provider, default_voice, voices_per_language, model, speed, style_preset}
language_outputs map No Key: BCP-47 lang code → LangOutput object
review object No {notes, reviewer_id, history[]}
language_qc map No Key: BCP-47 lang code → LanguageQCState
failure object No {step, type, message, retriable, occurred_at, retry_count}
created_at datetime Yes
updated_at datetime No

LangOutput sub-document:

Field Type Notes
captions_vtt_gcs string GCS URI
sdh_captions_vtt_gcs string SDH captions (speaker labels, sound effects, music)
ad_vtt_gcs string Audio description VTT
ad_mp3_gcs string Audio description MP3
accessible_video_gcs string Rendered accessible MP4
accessible_video_method string overlay or pause_insert
retimed_captions_vtt_gcs string Re-timed captions for pause-insert
ad_cue_manifest array Per-cue: {cue_index, gcs_uri, text, duration_s}
accessible_video_edit_state object Pause points, video segments, TTS regen queue
descriptive_transcript_gcs string WCAG combined transcript
origin string translate, transcreate, gemini_translate, video_native
qa_notes string QA reviewer notes

LanguageQCState sub-document:

Field Type Notes
status string pending, in_progress, pending_review, in_review, approved, rejected
assigned_linguist_id string
assigned_linguist_email string
assigned_reviewer_id string
assigned_reviewer_email string
linguist_deadline datetime
reviewer_deadline datetime
total_cues int Set when reviewer opens
reviewed_cues int Incremented as cues are reviewed
reject_category string timing, mistranslation, terminology, profanity, length
history array LanguageQCEvent[]
comments array LanguageQCComment[]

Job Status Enum:

Status Meaning
created Job record created, upload pending
ingesting Celery worker downloading/preparing video
ai_processing Gemini 2.5 Pro generating VTT
pending_qc Ready for QC review
approved_english English source VTT approved
approved_source Non-English source VTT approved
rejected Rejected at QC
qc_feedback Returned with QC notes
translating Google Translate running
tts_generating TTS synthesis in progress
tts_failed Legacy; use processing_failed + failure.step="tts" for new failures
rendering_video Accessible video render in progress
render_failed Legacy failure status
processing_failed Unified failure — see failure.step
rendering_qc Re-rendering during QC
pending_final_review PM final sign-off required
completed Fully delivered
cancelled Manually cancelled

Indexes: (status, created_at desc), org_id, client_id, project_id, created_by_user_id


organizations

Field Type Required Notes
_id ObjectId Yes
name string Yes
slug string Yes URL-safe identifier
is_active bool Yes
plan string Yes Default: standard
created_at datetime Yes
updated_at datetime No

memberships

Field Type Required Notes
_id ObjectId Yes
user_id string Yes
organization_id string Yes
role_in_org string Yes owner, admin, manager, member, viewer
team_ids string[] No Teams within the org
created_at datetime No
created_by string No

Indexes: (user_id, organization_id) unique, organization_id, user_id


clients

Field Type Notes
_id ObjectId
name string
slug string
is_active bool
created_at datetime
updated_at datetime

teams

Field Type Notes
_id ObjectId
name string
client_id string FK → clients
member_user_ids string[] User IDs
created_at datetime

projects

Field Type Notes
_id ObjectId
name string
client_id string FK → clients
is_active bool
default_languages string[] BCP-47 language codes
default_linguist_id string
default_reviewer_id string
created_at datetime
updated_at datetime

glossaries

Field Type Notes
_id ObjectId
client_id string FK → clients
name string
description string Optional
source_locale string BCP-47, e.g. en-GB
source string xlsx_upload or fraze_api
status string active, archived
current_version_id string FK → glossary_versions
created_at datetime
created_by string user_id

glossary_versions

Field Type Notes
_id ObjectId
glossary_id string FK → glossaries
version_number int Monotonically increasing
source_xlsx_gcs_path string Original XLSX on GCS
term_count int
embedded_count int Terms with embeddings
embedding_status string pending, in_progress, done, failed
created_at datetime
created_by string user_id
change_note string Optional version note

glossary_terms

Field Type Notes
_id ObjectId
glossary_id string FK → glossaries
version_id string FK → glossary_versions
source term string Source language term
translations map locale → translated string

job_briefs

Field Type Notes
_id ObjectId
organization_id string FK → organizations
project_id string Optional FK → projects
title string
description string Optional
requested_outputs object Same shape as jobs.requested_outputs
languages string[] Requested target languages
deadline datetime Optional
status string draft, submitted, approved, rejected, fulfilled
created_by string user_id
job_id string FK → jobs (set when brief is fulfilled)
created_at datetime
updated_at datetime
submitted_at datetime
approved_by string PM user_id
reject_reason string

vtt_versions

Field Type Notes
_id ObjectId
job_id string FK → jobs
lang string BCP-47 language code
kind string captions or ad
version int Monotonically increasing per (job_id, lang, kind)
content string Full VTT file content
gcs_uri string Backup copy on GCS
created_at datetime
created_by object {user_id, user_email}
note string Editor note
parent_version int Version this was derived from
cue_count int
byte_size int

share_tokens

Field Type Notes
_id string Token itself (32 hex chars) — used as PK
job_id string FK → jobs
organization_id string FK → organizations
created_by_user_id string
created_by_email string Denormalised
created_at datetime
expires_at datetime Null = never expires
is_active bool
label string Human note

invitations

Field Type Notes
_id ObjectId
email string Invitee email
organization_id string FK → organizations
role_in_org string OrgRole enum
target_team_ids string[] Teams to join on accept
token_hash string SHA-256 hash of invitation token
invited_by_user_id string
expires_at datetime
accepted_at datetime Null if pending
revoked_at datetime Null if active
created_at datetime

Indexes: token_hash, organization_id, email, TTL on expires_at


review_notes

Field Type Notes
_id ObjectId
job_id string FK → jobs
asset_key string e.g. en, es, en_accessible
timestamp_seconds float Video timestamp when note was placed
content string Note text
user_id string Author
user_name string Denormalised display name
created_at datetime
updated_at datetime

audit_logs

Field Type Notes
_id ObjectId
action string Dotted enum, e.g. auth.login.success, job.approve
user_id string Actor
target_type string e.g. job, user
target_id string Target resource ID
details object Freeform context
timestamp datetime

Indexes: timestamp (desc), user_id, action, (target_type, target_id)


migrations

Internal collection managed by backend/app/migrations/migrator.py.

Field Type Notes
_id ObjectId
version string Migration script name/timestamp
applied_at datetime

Indexes: version (unique), applied_at (desc)


GCS File Layout

All binary assets are stored in GCS; MongoDB stores only GCS URIs.

gs://{GCS_BUCKET}/jobs/{job_id}/
  source.mp4
  {lang}/
    captions.vtt
    ad.vtt
    ad.mp3
    accessible.mp4
    sdh_captions.vtt
    descriptive_transcript.txt
    cues/
      cue_{n}.mp3

Maintenance

Last Updated: 2026-05-01

Update Triggers:

  • New model file added in backend/app/models/
  • New migration script that adds fields or collections
  • Field added/removed from existing Pydantic model
  • New index created in a migration

Verification:

  • All files in backend/app/models/ represented
  • All indexes from backend/app/migrations/scripts/ reflected
  • Job status enum matches JobStatus in models/job.py
  • ER diagram relationships accurate